add limited metadata caching to journald and other journal improvements #6392

Merged
merged 13 commits into from Jul 31, 2017

Conversation

Projects
None yet
4 participants
Owner

poettering commented Jul 17, 2017

No description provided.

@poettering poettering added the journal label Jul 17, 2017

@@ -200,3 +200,10 @@ DEFINE_TRIVIAL_CLEANUP_FUNC(char *, string_free_erase);
#define _cleanup_string_free_erase_ _cleanup_(string_free_erasep)
bool string_is_safe(const char *p) _pure_;
+
+static inline size_t strlen_ptr(const char *s) {
@vcaputo

vcaputo Jul 19, 2017

Member

this seems quite unrelated to the topic of this commit

@keszybz

keszybz Jul 25, 2017

Owner

Seems OK to me.

src/journal/journald-context.c
+
+/* This implements a metadata cache for clients, which are identified by their PID. Requesting metadata through /proc
+ * is expensive, hence let's cache the data if we can. Note that this means the metadata might be out-of-date when we
+ * store it, but it might already be anyway, as we request the data asynchronously from /proc at a different time the
@vcaputo

vcaputo Jul 19, 2017

Member

nit: the current situation is actually the opposite of out of date; the metadata is sampled in the future relative to when the log was produced.

@keszybz

keszybz Jul 25, 2017

Owner

Yeah, that's complicated. We might actually have accurate metadata more often after this patch, e.g. if a client logs twice, and exits immediately after, and we manage to cache metadata after the first log entry.

src/journal/journald-context.c
+ if (!c)
+ return -ENOMEM;
+
+ c->n_ref = 0;
@vcaputo

vcaputo Jul 19, 2017

Member

redundant to new0()

src/journal/journald-context.c
+ else
+ (void) get_process_gid(c->pid, &c->gid);
+
+ return 0;
@vcaputo

vcaputo Jul 19, 2017

Member

why isn't this a void function if the errors are all being ignored?

src/journal/journald-context.c
+ assert(c);
+ assert(pid_is_valid(c->pid));
+
+ if (get_process_comm(c->pid, &t) >= 0) {
@vcaputo

vcaputo Jul 19, 2017

Member

Why not just use the free_and_replace() macro from basic/alloc-util.h? These can all lose the {}s even then.

src/journal/journald-context.c
+ c->capeff = t;
+ }
+
+ return 0;
@vcaputo

vcaputo Jul 19, 2017

Member

again; why not a void return?

src/journal/journald-context.c
+ if (!l)
+ return -ENOMEM;
+
+ memcpy(l, label, label_size);
@vcaputo

vcaputo Jul 19, 2017

Member

is there really not a convenience helper in systemd for doing this null-terminated memcpy?

src/journal/journald-context.c
+ memcpy(l, label, label_size);
+ l[label_size] = 0;
+
+ free(c->label);
@vcaputo

vcaputo Jul 19, 2017

Member

free_and_replace()

src/journal/journald-context.c
+ /* If we got no SELinux label passed in, let's try to acquire one */
+
+ if (getpidcon(c->pid, &con) >= 0) {
+ free(c->label);
@vcaputo

vcaputo Jul 19, 2017

Member

free_and_replace()

src/journal/journald-context.c
+ return 0;
+ }
+
+ free(c->cgroup);
@vcaputo

vcaputo Jul 19, 2017

Member

lots of free_and_replace() in this function, I'll stop pointing them out.

src/journal/journald-context.c
+ assert_se(prioq_reshuffle(s->client_contexts_lru, c, &c->lru_index) >= 0);
+ }
+
+ return 0;
@vcaputo

vcaputo Jul 19, 2017

Member

why not void?

src/journal/journald-context.c
+ if (label_size > 0 && (label_size != c->label_size || memcmp(label, c->label, label_size) != 0))
+ goto refresh;
+
+ return 0;
@vcaputo

vcaputo Jul 19, 2017

Member

another void return no? these all imply errors are propagated and that's simply not the case

src/journal/journald-context.c
+ return client_context_really_refresh(s, c, ucred, label, label_size, unit_id, timestamp);
+}
+
+static void client_context_make_room(Server *s, size_t limit) {
@vcaputo

vcaputo Jul 19, 2017

Member

This function name makes me expect the argument to be how much room to make.

I think it would be better named something like client_context_try_shrink_to() or something.

src/journal/journald-context.c
+ if (add_ref)
+ c->n_ref = 1;
+ else {
+ c->n_ref = 0;
@vcaputo

vcaputo Jul 19, 2017

Member

n_ref is already 0 from client_context_new()

src/journal/journald-context.c
+
+ }
+
+ return 0;
@vcaputo

vcaputo Jul 19, 2017

Member

why not a void return?

Member

vcaputo commented Jul 19, 2017

Neat, something in this vein is long overdue.

I just did a pretty casual review, mostly nits, otherwise 👍

It's good you addressed all the tiny allocation/frees for the various metadata fields. When I started reviewing I was concerned you wouldn't address that aspect while adding the cache. Your alloca approach is less invasive than mine was, though I avoided all copies. I think I like yours more.

Owner

poettering commented Jul 20, 2017

Thanks for the review! I have now force pushed a new version with almost all of your points fixed. I did leave some functions returning "int", even though the caller ignores it then. It just feels weird to eat obvious OOM issues right away in the callee, it felt more natural to leave this to the caller. I mean, ultimately it doesn't really matter anyway, the compiler should optimize all this away easily as this stuff is all static, non-exported stuff...

Anway, I hope that makes some sense. Please have another look so that we can get this landed!

vcaputo approved these changes Jul 20, 2017 edited

You have my approval, FWIW. Note I've only looked at the last commit...

Some initial comments. i didn't review the main patch yet.

src/basic/alloc-util.h
+
+ /* The same as memdup() but place a safety NUL byte after the allocated memory */
+
+ q = memdup(p, l+1);
@keszybz

keszybz Jul 25, 2017

Owner

This doesn't look right. Based on the commit description, this could be used to bytes from a fixed size buffer, right to the edge. Then this memdup will read one past the allowed area. This must be replaced by malloc + memcpy.

src/basic/process-util.c
@@ -388,7 +388,7 @@ int is_kernel_thread(pid_t pid) {
bool eof;
FILE *f;
- if (pid == 0 || pid == 1) /* pid 1, and we ourselves certainly aren't a kernel thread */
+ if (pid == 0 || pid == 1 || pid == getpid()) /* pid 1, and we ourselves certainly aren't a kernel thread */
@keszybz

keszybz Jul 25, 2017

Owner

Shouldn't this be getpid_cached()? Without that, this additional check might slow things down.

@poettering

poettering Jul 31, 2017

Owner

yupp, this PR predates the getpid_cached() PR, hence it's based on a version without it. Will rebase.

@@ -200,3 +200,10 @@ DEFINE_TRIVIAL_CLEANUP_FUNC(char *, string_free_erase);
#define _cleanup_string_free_erase_ _cleanup_(string_free_erasep)
bool string_is_safe(const char *p) _pure_;
+
+static inline size_t strlen_ptr(const char *s) {
@vcaputo

vcaputo Jul 19, 2017

Member

this seems quite unrelated to the topic of this commit

@keszybz

keszybz Jul 25, 2017

Owner

Seems OK to me.

src/journal/journald-context.c
+#include <selinux/selinux.h>
+#endif
+
+#include <assert.h>
@keszybz

keszybz Jul 25, 2017

Owner

Don't include assert.h. We redefine assert independently.

src/journal/journald-context.c
+#include "user-util.h"
+#include "audit-util.h"
+#include "string-util.h"
+#include "cgroup-util.h"
@keszybz

keszybz Jul 25, 2017

Owner

Any particular reason to have those unsorted?

src/journal/journald-context.c
+
+/* This implements a metadata cache for clients, which are identified by their PID. Requesting metadata through /proc
+ * is expensive, hence let's cache the data if we can. Note that this means the metadata might be out-of-date when we
+ * store it, but it might already be anyway, as we request the data asynchronously from /proc at a different time the
@vcaputo

vcaputo Jul 19, 2017

Member

nit: the current situation is actually the opposite of out of date; the metadata is sampled in the future relative to when the log was produced.

@keszybz

keszybz Jul 25, 2017

Owner

Yeah, that's complicated. We might actually have accurate metadata more often after this patch, e.g. if a client logs twice, and exits immediately after, and we manage to cache metadata after the first log entry.

src/journal/journald-rate-limit.c
+ /* Returns:
+ *
+ * 0 → the log message shall be suppressed,
+ * 1 + n → if the log message shall be permitted, and n messages where dropped from the peer before
@keszybz

keszybz Jul 25, 2017

Owner

if, and where → were

src/journal/journald-server.c
+ /* Write a suppression message if we suppressed something */
+ if (rl > 1)
+ server_driver_message(s, "MESSAGE_ID=" SD_MESSAGE_JOURNAL_DROPPED_STR,
+ LOG_MESSAGE("Suppressed %u messages from %s", rl - 1, path),
@glasser

glasser Jul 25, 2017

Contributor

Won't path always be NULL here?

+
+ /* If that didn't work, we use the unit ID passed in as fallback, if we have nothing cached yet */
+ if (unit_id && !c->unit) {
+ c->unit = strdup(unit_id);
@keszybz

keszybz Jul 26, 2017

Owner

Hmm, is it on purpose that in client_context_read_label free_and_replace is used, and here just normal assignment (no free)?

@poettering

poettering Jul 31, 2017

Owner

yes, it is... we trust the data from /proc more, and only use the data from the peer if we have nothing better. Hence we only set c->unit if it is NULL so far, as the comment is supposed to clarify.

@keszybz

keszybz Jul 31, 2017

Owner

Ah, right.

src/journal/journald-context.c
+ if (cg_path_get_session(c->cgroup, &t) >= 0)
+ free_and_replace(c->session, t);
+ else
+ c->session = mfree(c->session);
@keszybz

keszybz Jul 26, 2017

Owner

Maybe it's too much magic, but lines 276–279 can be replaced with

(void) cg_path_get_session(c->cgroup, &t);
free_and_replace(c->session, t);
+ ClientContext **ret) {
+
+ return client_context_get_internal(s, pid, ucred, label, label_len, unit_id, true, ret);
+};
@keszybz

keszybz Jul 26, 2017

Owner

Shouldn't those be static inline functions? Or are we counting on lto to figure things out for us?

@poettering

poettering Jul 31, 2017

Owner

client_context_get_internal() is a static function, and gcc should be smart enough to optimize this away for us, and at least turn this into JMP rather than CALL, which I am very sure is good enough

src/journal/journald-context.c
+
+ if (!s->my_context) {
+ struct ucred ucred = {
+ .pid = getpid(),
@keszybz

keszybz Jul 26, 2017

Owner

getpid_cached()?

src/journal/journald-server.c
static void dispatch_message_real(
Server *s,
struct iovec *iovec, unsigned n, unsigned m,
- const struct ucred *ucred,
+ ClientContext *c,

poettering added some commits Jul 14, 2017

escape: fix systemd-escape description text
The long man page paragraph got it right: the tool is for escaping systemd unit
names, not just system unit names. Also fix the short man page paragraph
and the --help text.

Follow-up for 303608c
audit: introduce audit_session_is_valid() and make use of it everywhere
Let's add a proper validation function, since validation isn't entirely
trivial. Make use of it where applicable. Also make use of
AUDIT_SESSION_INVALID where we need a marker for an invalid audit
session.
parse-util: introduce pid_is_valid()
Checking for validity of a PID is relatively easy, but let's add a
helper cal for this too, in order to make things more readable and more
similar to uid_is_valid(), gid_is_valid() and friends.
execute: make some code shorter
Let's simplify some lines to make it shorter.
execute: don't pass unit ID in --user mode to journald for stream log…
…ging

When we create a log stream connection to journald, we pass along the
unit ID. With this change we do this only when we run as system
instance, not as user instance, to remove the ambiguity whether a user
or system unit is specified. The effect of this change is minor:
journald ignores the field anyway from clients with UID != 0. This patch
hence only fixes the unit attribution for the --user instance of the
root user.
journald: add comment explaining journal rate limit return codes
This is not obvious, hence let's add a comment.
journald: only accept valid unit names for log streams
Let's be a bit stricter in what we end up logging: ignore invalid unit
name specifications. Let's validate all input!

As we ignore unit names passed in from unprivileged clients anyway the
effect of this additional check is minimal.

(Also, no need to initialize the identifier/unit_id fields of stream
objects to NULL if empty strings are passed, the default is NULL
anyway...)
process-util: slightly optimize querying of our own process metadata
When we are checking our own data, we can optimize things a bit.
string-util: add strlen_ptr() helper
strlen_ptr() is to strlen() what streq_ptr() is to streq(): i.e. it
handles NULL strings in a smart way.
alloc-util: add new helpers memdup_suffix0() and newdup_suffix0()
These are similar to memdup() and newdup(), but reserve one extra NUL
byte at the end of the new allocation and initialize it. It's useful
when copying out data from fixed size character arrays where NUL
termination can't be assumed.
string-util: optimize strshorten() a bit
There's no reason to determine the full length of the string, it's
sufficient to know whether it is larger than the intended size...
journald: add minimal client metadata caching
Cache client metadata, in order to be improve runtime behaviour under
pressure.

This is inspired by @vcaputo's work, specifically:

#2280

That code implements related but different semantics.

For a longer explanation what this change implements please have a look
at the long source comment this patch adds to journald-context.c.

After this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.2783 s, 95.2 MB/s

        real	0m11.283s
        user	0m0.007s
        sys	0m6.216s

Before this commit:

        # time bash -c 'dd bs=$((1024*1024)) count=$((1*1024)) if=/dev/urandom | systemd-cat'
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 52.0788 s, 20.6 MB/s

        real	0m52.099s
        user	0m0.014s
        sys	0m7.170s

As side effect, this corrects the journal's rate limiter feature: we now
always use the unit name as key for the ratelimiter.
Owner

poettering commented Jul 31, 2017

force pushed a new version, please have a look. addressed all issues raised

+ /* Write a suppression message if we suppressed something */
+ if (rl > 1)
+ server_driver_message(s, "MESSAGE_ID=" SD_MESSAGE_JOURNAL_DROPPED_STR,
+ LOG_MESSAGE("Suppressed %u messages from %s", rl - 1, c->unit),
@glasser

glasser Jul 31, 2017

Contributor

I actually have a script which parses this message and looks at the path (so eg, right now it is looking for /system.slice/docker.service here). Can you confirm that after this change I should be looking for docker.service instead? That's the understanding I get from reading this PR but I don't actually know how to build and test systemd :)

@poettering

poettering Jul 31, 2017

Owner

i figure we should add _OBJECT_UNIT_NAME= to this line to make it easy to match against this log message in a structured way. But that should probably happen in a later commit

@poettering

poettering Jul 31, 2017

Owner

and yeah, you should check the unit name, not the cgroup path, as we do take liberty that the path might change

@keszybz

keszybz Jul 31, 2017

Owner

It'd be nice to add SYSTEMD_UNIT= field too to that log message, so it shows up in 'journalctl -u' output, but that's not really related to this PR.

@glasser

glasser Jul 31, 2017

Contributor

yeah, that would certainly be nicer than us having to run journalctl -f _SYSTEMD_UNIT=docker.service + MESSAGE_ID=a596d6fe7bfa4994828e72309e95d61e and then filter out based on parsing a string but I agree this is probably separate ;)

@poettering

poettering Jul 31, 2017

Owner

I filed #6494 as an RFE bug to add such a hook-up

@poettering poettering merged commit 6b43d07 into systemd:master Jul 31, 2017

4 checks passed

semaphoreci The build passed on Semaphore.
Details
xenial-amd64 autopkgtest finished (success)
Details
xenial-i386 autopkgtest finished (success)
Details
xenial-s390x autopkgtest finished (success)
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment