Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add limited metadata caching to journald and other journal improvements #6392

Merged
merged 13 commits into from Jul 31, 2017

Conversation

4 participants
@poettering
Copy link
Member

commented Jul 17, 2017

No description provided.

@poettering poettering added the journal label Jul 17, 2017

@poettering poettering force-pushed the poettering:journal-cache branch 2 times, most recently from 6615f40 to 91d4231 Jul 17, 2017

@@ -200,3 +200,10 @@ DEFINE_TRIVIAL_CLEANUP_FUNC(char *, string_free_erase);
#define _cleanup_string_free_erase_ _cleanup_(string_free_erasep)

bool string_is_safe(const char *p) _pure_;

static inline size_t strlen_ptr(const char *s) {

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

this seems quite unrelated to the topic of this commit

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Seems OK to me.


/* This implements a metadata cache for clients, which are identified by their PID. Requesting metadata through /proc
* is expensive, hence let's cache the data if we can. Note that this means the metadata might be out-of-date when we
* store it, but it might already be anyway, as we request the data asynchronously from /proc at a different time the

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

nit: the current situation is actually the opposite of out of date; the metadata is sampled in the future relative to when the log was produced.

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Yeah, that's complicated. We might actually have accurate metadata more often after this patch, e.g. if a client logs twice, and exits immediately after, and we manage to cache metadata after the first log entry.

if (!c)
return -ENOMEM;

c->n_ref = 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

redundant to new0()

else
(void) get_process_gid(c->pid, &c->gid);

return 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

why isn't this a void function if the errors are all being ignored?

assert(c);
assert(pid_is_valid(c->pid));

if (get_process_comm(c->pid, &t) >= 0) {

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

Why not just use the free_and_replace() macro from basic/alloc-util.h? These can all lose the {}s even then.

c->capeff = t;
}

return 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

again; why not a void return?

if (!l)
return -ENOMEM;

memcpy(l, label, label_size);

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

is there really not a convenience helper in systemd for doing this null-terminated memcpy?

memcpy(l, label, label_size);
l[label_size] = 0;

free(c->label);

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

free_and_replace()

/* If we got no SELinux label passed in, let's try to acquire one */

if (getpidcon(c->pid, &con) >= 0) {
free(c->label);

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

free_and_replace()

return 0;
}

free(c->cgroup);

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

lots of free_and_replace() in this function, I'll stop pointing them out.

assert_se(prioq_reshuffle(s->client_contexts_lru, c, &c->lru_index) >= 0);
}

return 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

why not void?

if (label_size > 0 && (label_size != c->label_size || memcmp(label, c->label, label_size) != 0))
goto refresh;

return 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

another void return no? these all imply errors are propagated and that's simply not the case

return client_context_really_refresh(s, c, ucred, label, label_size, unit_id, timestamp);
}

static void client_context_make_room(Server *s, size_t limit) {

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

This function name makes me expect the argument to be how much room to make.

I think it would be better named something like client_context_try_shrink_to() or something.

if (add_ref)
c->n_ref = 1;
else {
c->n_ref = 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

n_ref is already 0 from client_context_new()


}

return 0;

This comment has been minimized.

Copy link
@vcaputo

vcaputo Jul 19, 2017

Member

why not a void return?

@vcaputo

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

Neat, something in this vein is long overdue.

I just did a pretty casual review, mostly nits, otherwise 👍

It's good you addressed all the tiny allocation/frees for the various metadata fields. When I started reviewing I was concerned you wouldn't address that aspect while adding the cache. Your alloca approach is less invasive than mine was, though I avoided all copies. I think I like yours more.

@poettering

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2017

Thanks for the review! I have now force pushed a new version with almost all of your points fixed. I did leave some functions returning "int", even though the caller ignores it then. It just feels weird to eat obvious OOM issues right away in the callee, it felt more natural to leave this to the caller. I mean, ultimately it doesn't really matter anyway, the compiler should optimize all this away easily as this stuff is all static, non-exported stuff...

Anway, I hope that makes some sense. Please have another look so that we can get this landed!

@poettering poettering force-pushed the poettering:journal-cache branch 2 times, most recently from 5116a46 to 4f7f51b Jul 20, 2017

@vcaputo
Copy link
Member

left a comment

You have my approval, FWIW. Note I've only looked at the last commit...

@keszybz
Copy link
Member

left a comment

Some initial comments. i didn't review the main patch yet.

/* Returns:
*
* 0 → the log message shall be suppressed,
* 1 + n → if the log message shall be permitted, and n messages where dropped from the peer before

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

if, and where → were

@@ -388,7 +388,7 @@ int is_kernel_thread(pid_t pid) {
bool eof;
FILE *f;

if (pid == 0 || pid == 1) /* pid 1, and we ourselves certainly aren't a kernel thread */
if (pid == 0 || pid == 1 || pid == getpid()) /* pid 1, and we ourselves certainly aren't a kernel thread */

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Shouldn't this be getpid_cached()? Without that, this additional check might slow things down.

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

yupp, this PR predates the getpid_cached() PR, hence it's based on a version without it. Will rebase.

@@ -200,3 +200,10 @@ DEFINE_TRIVIAL_CLEANUP_FUNC(char *, string_free_erase);
#define _cleanup_string_free_erase_ _cleanup_(string_free_erasep)

bool string_is_safe(const char *p) _pure_;

static inline size_t strlen_ptr(const char *s) {

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Seems OK to me.


/* The same as memdup() but place a safety NUL byte after the allocated memory */

q = memdup(p, l+1);

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

This doesn't look right. Based on the commit description, this could be used to bytes from a fixed size buffer, right to the edge. Then this memdup will read one past the allowed area. This must be replaced by malloc + memcpy.

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

ouch!

#include <selinux/selinux.h>
#endif

#include <assert.h>

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Don't include assert.h. We redefine assert independently.

#include "user-util.h"
#include "audit-util.h"
#include "string-util.h"
#include "cgroup-util.h"

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Any particular reason to have those unsorted?

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

nope


/* This implements a metadata cache for clients, which are identified by their PID. Requesting metadata through /proc
* is expensive, hence let's cache the data if we can. Note that this means the metadata might be out-of-date when we
* store it, but it might already be anyway, as we request the data asynchronously from /proc at a different time the

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 25, 2017

Member

Yeah, that's complicated. We might actually have accurate metadata more often after this patch, e.g. if a client logs twice, and exits immediately after, and we manage to cache metadata after the first log entry.

/* Write a suppression message if we suppressed something */
if (rl > 1)
server_driver_message(s, "MESSAGE_ID=" SD_MESSAGE_JOURNAL_DROPPED_STR,
LOG_MESSAGE("Suppressed %u messages from %s", rl - 1, path),

This comment has been minimized.

Copy link
@glasser

glasser Jul 25, 2017

Contributor

Won't path always be NULL here?

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

indeed!


/* If that didn't work, we use the unit ID passed in as fallback, if we have nothing cached yet */
if (unit_id && !c->unit) {
c->unit = strdup(unit_id);

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 26, 2017

Member

Hmm, is it on purpose that in client_context_read_label free_and_replace is used, and here just normal assignment (no free)?

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

yes, it is... we trust the data from /proc more, and only use the data from the peer if we have nothing better. Hence we only set c->unit if it is NULL so far, as the comment is supposed to clarify.

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 31, 2017

Member

Ah, right.

if (cg_path_get_session(c->cgroup, &t) >= 0)
free_and_replace(c->session, t);
else
c->session = mfree(c->session);

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 26, 2017

Member

Maybe it's too much magic, but lines 276–279 can be replaced with

(void) cg_path_get_session(c->cgroup, &t);
free_and_replace(c->session, t);
ClientContext **ret) {

return client_context_get_internal(s, pid, ucred, label, label_len, unit_id, true, ret);
};

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 26, 2017

Member

Shouldn't those be static inline functions? Or are we counting on lto to figure things out for us?

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

client_context_get_internal() is a static function, and gcc should be smart enough to optimize this away for us, and at least turn this into JMP rather than CALL, which I am very sure is good enough


if (!s->my_context) {
struct ucred ucred = {
.pid = getpid(),

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 26, 2017

Member

getpid_cached()?


static void dispatch_message_real(
Server *s,
struct iovec *iovec, unsigned n, unsigned m,
const struct ucred *ucred,
ClientContext *c,

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 26, 2017

Member

const?

escape: fix systemd-escape description text
The long man page paragraph got it right: the tool is for escaping systemd unit
names, not just system unit names. Also fix the short man page paragraph
and the --help text.

Follow-up for 303608c

poettering added some commits Jul 14, 2017

audit: introduce audit_session_is_valid() and make use of it everywhere
Let's add a proper validation function, since validation isn't entirely
trivial. Make use of it where applicable. Also make use of
AUDIT_SESSION_INVALID where we need a marker for an invalid audit
session.
parse-util: introduce pid_is_valid()
Checking for validity of a PID is relatively easy, but let's add a
helper cal for this too, in order to make things more readable and more
similar to uid_is_valid(), gid_is_valid() and friends.
execute: make some code shorter
Let's simplify some lines to make it shorter.
execute: don't pass unit ID in --user mode to journald for stream log…
…ging

When we create a log stream connection to journald, we pass along the
unit ID. With this change we do this only when we run as system
instance, not as user instance, to remove the ambiguity whether a user
or system unit is specified. The effect of this change is minor:
journald ignores the field anyway from clients with UID != 0. This patch
hence only fixes the unit attribution for the --user instance of the
root user.
journald: add comment explaining journal rate limit return codes
This is not obvious, hence let's add a comment.
journald: only accept valid unit names for log streams
Let's be a bit stricter in what we end up logging: ignore invalid unit
name specifications. Let's validate all input!

As we ignore unit names passed in from unprivileged clients anyway the
effect of this additional check is minimal.

(Also, no need to initialize the identifier/unit_id fields of stream
objects to NULL if empty strings are passed, the default is NULL
anyway...)
process-util: slightly optimize querying of our own process metadata
When we are checking our own data, we can optimize things a bit.
string-util: add strlen_ptr() helper
strlen_ptr() is to strlen() what streq_ptr() is to streq(): i.e. it
handles NULL strings in a smart way.
alloc-util: add new helpers memdup_suffix0() and newdup_suffix0()
These are similar to memdup() and newdup(), but reserve one extra NUL
byte at the end of the new allocation and initialize it. It's useful
when copying out data from fixed size character arrays where NUL
termination can't be assumed.
string-util: optimize strshorten() a bit
There's no reason to determine the full length of the string, it's
sufficient to know whether it is larger than the intended size...
journald: add minimal client metadata caching
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.

@poettering poettering force-pushed the poettering:journal-cache branch from 4f7f51b to 22e3a02 Jul 31, 2017

@poettering

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2017

force pushed a new version, please have a look. addressed all issues raised

/* Write a suppression message if we suppressed something */
if (rl > 1)
server_driver_message(s, "MESSAGE_ID=" SD_MESSAGE_JOURNAL_DROPPED_STR,
LOG_MESSAGE("Suppressed %u messages from %s", rl - 1, c->unit),

This comment has been minimized.

Copy link
@glasser

glasser Jul 31, 2017

Contributor

I actually have a script which parses this message and looks at the path (so eg, right now it is looking for /system.slice/docker.service here). Can you confirm that after this change I should be looking for docker.service instead? That's the understanding I get from reading this PR but I don't actually know how to build and test systemd :)

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

i figure we should add _OBJECT_UNIT_NAME= to this line to make it easy to match against this log message in a structured way. But that should probably happen in a later commit

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

and yeah, you should check the unit name, not the cgroup path, as we do take liberty that the path might change

This comment has been minimized.

Copy link
@keszybz

This comment has been minimized.

Copy link
@keszybz

keszybz Jul 31, 2017

Member

It'd be nice to add SYSTEMD_UNIT= field too to that log message, so it shows up in 'journalctl -u' output, but that's not really related to this PR.

This comment has been minimized.

Copy link
@glasser

glasser Jul 31, 2017

Contributor

yeah, that would certainly be nicer than us having to run journalctl -f _SYSTEMD_UNIT=docker.service + MESSAGE_ID=a596d6fe7bfa4994828e72309e95d61e and then filter out based on parsing a string but I agree this is probably separate ;)

This comment has been minimized.

Copy link
@poettering

poettering Jul 31, 2017

Author Member

I filed #6494 as an RFE bug to add such a hook-up

@poettering poettering merged commit 6b43d07 into systemd:master Jul 31, 2017

4 checks passed

semaphoreci The build passed on Semaphore.
Details
xenial-amd64 autopkgtest finished (success)
Details
xenial-i386 autopkgtest finished (success)
Details
xenial-s390x autopkgtest finished (success)
Details

@vcaputo vcaputo referenced this pull request Aug 27, 2017

Closed

systemd-journal uses too much CPU #5102

1 of 2 tasks complete

@josh-mesos josh-mesos referenced this pull request Jul 20, 2018

Merged

Improved install docs for 1.11 #1192

0 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.