Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journalctl: 13x faster by caching fstat() and optimizing timestamps retrieval #29261

Closed
wants to merge 5 commits into from

Conversation

ktsaou
Copy link
Contributor

@ktsaou ktsaou commented Sep 20, 2023

I noticed that journalctl is quite slow when there several million lines available to be queried.

To emulate some load I used this on my development machine:

journalctl --no-pager --since "60 days a ago"

This gives a little more than 4.7 million log entries (produces output of 765 MiB) and the current main branch needs a little more than 123 seconds to complete.

I compiled it with this:

rm -rf build; meson --buildtype=debugoptimized build . && ninja -C build version.h && ninja -C build libsystemd && ninja -C build journalctl

and run it through a profiler, to get this:

image

Note the big yellow lines of fstatat. So, it seems that although my system has just a bunch of files, journalctl repeatedly run fstat on them.

click to see my `/var/log/journal`
# tree /var/log/journal                                                                                                                                                                                        /var/log/journal
├── 929c259028424a2da82cc8cd09e0bca1
│   ├── system@000604b194adb5b4-432725c4395f2ced.journal~
│   ├── system@000604b29619ad8f-41625a1016b4fa3f.journal~
│   ├── system@000605c820ed5e4e-0e2195376e8833a6.journal~
│   ├── system@000605c89e3ac856-97082da1c64c70bd.journal~
│   ├── system@000938c3307c48fe805e8f1503db53cd-00000000017f07e1-000605c9ec86601e.journal
│   ├── system@000938c3307c48fe805e8f1503db53cd-00000000017f0e44-000605c9ecbdd4a2.journal
│   ├── system@3a9c2394c05f486fbd9443cdfe1780b5-00000000017f5dde-000605ca7d212d36.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000137bdab-0006041a0bf1092c.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001385605-0006043229dbdde8.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001385cab-000604322a0ca39d.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013a35b6-0006047546c513b6.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013c117f-000604b194ad7d1e.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013c3ef4-000604b296197720.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013ca6da-000604c36d065146.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013cb281-000604c36dad7ecf.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000013eb995-0006050089b2e51f.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001407ff2-00060539c8321794.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000014086d0-00060539c86103d4.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000142d399-0006055d75b014c8.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000014558c9-00060567760d0535.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000147f273-00060570073e46ee.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000014a83b9-000605760d2c26ba.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000014d0262-0006057fb0cfed1a.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000014f9aed-000605835126e06f.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001523597-0006058699943758.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000154d0e3-00060589b9c62615.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001576c3f-0006058cc51a3d60.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000015a07ff-0006058f7589d1cb.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000015ca4fb-00060591f7c690be.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000015f4295-00060594622db439.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000161dfc9-0006059693f390ed.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001647dbe-00060598a1f55c31.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001671c1b-0006059aa11305a7.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000169ba56-0006059c92c625e9.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000016c6792-000605a40540f5a0.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000016eeee7-000605ac30c09239.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-000000000171850e-000605b0b80f228d.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-00000000017413d5-000605b6176e33b1.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001769a94-000605bbc6f9d29b.journal
│   ├── system@63bb8f3469bc4275b6047431486909ea-0000000001792d8a-000605c139b941cf.journal
│   ├── system@67d8a2003bb847e0bf61434bfc9efc48-00000000017da029-000605c83724ebad.journal
│   ├── system@67d8a2003bb847e0bf61434bfc9efc48-00000000017da767-000605c837597e5a.journal
│   ├── system@839368cfceef4c58b7134b01deac2654-00000000017e05ac-000605c89601394e.journal
│   ├── system@839368cfceef4c58b7134b01deac2654-00000000017e0c96-000605c896322abb.journal
│   ├── system@83f233550ffb4274ba3c19e81986198a-00000000017d4e9a-000605c820ed296a.journal
│   ├── system@b13ced2012b6441bbbbaad080611fdbd-00000000017ede54-000605c98c353cc5.journal
│   ├── system@b13ced2012b6441bbbbaad080611fdbd-00000000017ee4a4-000605c98c68acac.journal
│   ├── system@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6827-000605c92539822d.journal
│   ├── system@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6fd8-000605c9256f59f8.journal
│   ├── system@db4726d85842415b932cc3811c064b5a-00000000017df0d9-000605c8926b42a1.journal
│   ├── system@db4726d85842415b932cc3811c064b5a-00000000017df817-000605c8929b665b.journal
│   ├── system@e5806de69bae41668771ad3cb7d27443-00000000017e1f96-000605c89e3a8fab.journal
│   ├── system@e5806de69bae41668771ad3cb7d27443-00000000017e26b3-000605c89e69bc93.journal
│   ├── system.journal
│   ├── user-1001@000604b18099c1eb-0a6766d61185ee96.journal~
│   ├── user-1001@000604b194d0bbb6-381638a635ac22c5.journal~
│   ├── user-1001@000604b28e8f26b5-f8c6939712f9e86d.journal~
│   ├── user-1001@000604b2a0d6ff56-2dd3405342cb0375.journal~
│   ├── user-1001@000605c8211d7d03-75f549522714e133.journal~
│   ├── user-1001@000938c3307c48fe805e8f1503db53cd-00000000017f0e43-000605c9ecba7eb5.journal
│   ├── user-1001@67d8a2003bb847e0bf61434bfc9efc48-00000000017da766-000605c8375683b8.journal
│   ├── user-1001@839368cfceef4c58b7134b01deac2654-00000000017e0c95-000605c8963039b7.journal
│   ├── user-1001@83f233550ffb4274ba3c19e81986198a-00000000017d55e7-000605c8211d7a33.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000137bdaa-0006041a0bed0da9.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001385caa-000604322a04b6db.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013c4b40-000604b2a0d6fda8.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013cb280-000604c36dab247e.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013eb994-0006050089af26cb.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014086cf-00060539c8558450.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000142d398-0006055d75ac39c2.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014558c8-0006056776093fbf.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000147f272-00060570073aab1e.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014a83b8-000605760d2892ca.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014d0261-0006057fb0cc6d4a.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014f9aec-000605835123b98a.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001523596-000605869990e084.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000154d0e2-00060589b9c2a320.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001576c3e-0006058cc516ff32.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015a07fe-0006058f758696b5.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015ca4fa-00060591f7c355e5.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015f4294-00060594622a5997.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000161dfc8-0006059693f03440.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001647dbd-00060598a1f1f93e.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001671c1a-0006059aa10fa18c.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000169ba55-0006059c92c2da9c.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000016c6791-000605a4053c8ef3.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000016eeee6-000605ac30bce604.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000171850d-000605b0b80bf167.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000017413d4-000605b6176aaeb1.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001769a93-000605bbc6f61ff4.journal
│   ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001792d89-000605c139b5c24a.journal
│   ├── user-1001@b13ced2012b6441bbbbaad080611fdbd-00000000017ee4a3-000605c98c64ace0.journal
│   ├── user-1001@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6fd7-000605c9256ce2c0.journal
│   ├── user-1001@db4726d85842415b932cc3811c064b5a-00000000017df816-000605c892981c8a.journal
│   ├── user-1001@e5806de69bae41668771ad3cb7d27443-00000000017e26b2-000605c89e67dd9d.journal
│   ├── user-1001.journal
│   ├── user-1002@67d8a2003bb847e0bf61434bfc9efc48-00000000017dae06-000605c8396da9fa.journal
│   ├── user-1002@83f233550ffb4274ba3c19e81986198a-00000000017d709b-000605c824370233.journal
│   ├── user-1002@b13ced2012b6441bbbbaad080611fdbd-00000000017eea65-000605c98d489d57.journal
│   └── user-1002@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e7f41-000605c926fa1e58.journal
└── remote

3 directories, 100 files

OPTIMIZATION 1

I noticed that compare_boot_ids() is responsible for about 40% of execution time.

In this PR I added a simple hashtable that maps boot_ids to newest (machine_id, realtime_usec). This hashtable preallocates 1000 pointers at sd_journal (8KiB on 64bit machines). On each slot a linked list of struct boot_id_newest_entry may exist. For the hash I used something trivial, like (boot_id.qwords[0] | boot_id.qwords[1]) % BOOT_ID_TO_NEWEST_HASHTABLE_SIZE.

Of course, the memory allocated is freed when sd_journal is closed.

OPTIMIZATION 2

The next problem was journal_file_read_tail_timestamp(). This function calls journal_file_fstat() everytime it is called, which results in really a lot fstat() calls.

The solution was simple. Inside JournalFile you already have the last_stat_usec member. Since you already do a now(CLOCK_MONOTONIC) for every invocation, I used 2 new members to cache the last fstat() results for up to 1 second. If 1 second passes since the last time the file was fstat(), the call is made again.

IMPROVEMENT

From 123 seconds, the time is now reduced to 29 seconds. More than 4 times faster. I would say a very good improvement for just a little memory.

image

Of course I also checked the output of both runs, and they are line-by-line, the same.

Note that still __clock_gettime gets about 2 seconds of the 29. This is really a lot of clock retrieval requests. This call is already in current main branch, but hidden because fstat() takes the premium.

I guess, a better solution (to avoid also these clock calls) would be to use inotify to receive updates about the directories you are interested and avoid doing all these fstat() calls all the time.

The biggest changes in this PR are:

  1. The latest real_time_usec per boot id, is cached for the lifetime of sd_journal.
  2. The files are fstat(), at most once per second.
  3. A few KiB more are needed to cache the above (on 64 bit: 8KiB always + 56 bytes per boot id encountered during the query).

I have not stepped too much into your code, so feel free to close this and implement it your own way.

To my understanding 29 seconds for 4.7 million entries are still a lot. Yes, I know the number of operations is tremendous. At Netdata we implemented an on-the-fly indexing for systemd-journal logs, using the public APIs, to produce something like this (note than it has counters for every label value, so it traverses all available fields):

image

In the final graph, I see that sd_journal_enumerate_data() is called twice. Once by get_display_timestamp() and another by output_short(), for about 6 seconds each (actually finding the timestamp is more expensive than finding the MESSAGE). Merging the 2 calls may require some some refactoring of the public APIs, but I think it probably justifies the effort if the time can further be reduced by 20 - 30%.

@github-actions github-actions bot added journal please-review PR is ready for (re-)review by a maintainer labels Sep 20, 2023
@yuwata
Copy link
Member

yuwata commented Sep 21, 2023

https://lwn.net/Articles/944214/
If the issue explained in the article is resolved, the performance should be slightly improved.

@mrc0mmand mrc0mmand added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Sep 21, 2023
@github-actions github-actions bot removed the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Sep 21, 2023
@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 21, 2023

Instead of running fstat() caching mode all the time, I added the public call sd_journal_enable_fast_query(), which enables the fstat() caching and preserves the original behavior when it is not set. In this mode fstat() is cached for the lifetime of sd_journal (is called once and never updated again).

I was hopping that this would solve the unit tests problem too.

However, it seems that public functions need something more in order for CI to complete.

BTW, with the latest update the time went down to 26 seconds, because even the clock is not queried all the time.

I also tested the same query with the netdata plugin. It needs 16 seconds for traversing the same data and all the available fields, and on-the-fly indexing of all field values.

image

In the above chart, the right half part is libsystemd, the left is our indexing library.

The difference is tremendous (15 vs 26 seconds). So, I understand journalctl can be optimized further.

@yuwata yuwata added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Sep 21, 2023
@github-actions github-actions bot added util-lib journal-remote and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Sep 23, 2023
@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 23, 2023

I pushed a change to remove the previous call I added (sd_journal_enable_fast_query()) and replace it with the flag SD_JOURNAL_FAST_QUERY.

@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 23, 2023

The delta from 15 to 26 seconds, is because of get_display_timestamp(). The only use of it is by show_journal_entry() which is responsible for generating journalctl output.

get_display_timestamp() traverses the log entry fields to find _SOURCE_REALTIME_TIMESTAMP and _SOURCE_MONOTONIC_TIMESTAMP, so that journalctl prints the actual date of the log entry and not the timestamps the log was received.

However, show_journal_entry() calls get_display_timestamp() before the output functions are called, so this means that JOURNAL_FOREACH_DATA is traversed twice: once to find the source timestamps and another time by the output function.

Ideally, the logic of finding the timestamps should be integrated into the loops the output functions do, so that the fields are traversed only once.

This is important, it accounts for about 42% of the total time journalctl needs, after all the optimizations in this PR.

@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 23, 2023

I changed output_short() to do the work get_display_timestamp() does, while traversing the fields it needs.

I have not changed the other 4 output formats. So, output_json, output_cat and output_verbose and output_export have the benefit of fstat() caching, but still they call get_display_timestamp().

New Comparison (new dataset)

Current master

The current main branch needs 371 seconds (6+ minutes):

image

This PR

The new code needs 28.29 seconds, 13.3 times faster!

image

The output of the current main branch and this PR is exactly the same, line-by-line.

Now the performance of journalctl matches the performance of Netdata using libsystemd, for the same dataset.

@ktsaou ktsaou changed the title Journalctl: do not fstat() too much Journalctl: 13x faster by caching fstat() and optimizing timestamps retrieval Sep 23, 2023
@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 23, 2023

I see 2 additional problems that may hurt performance:

  1. A lot of mmap() and munmap() calls. You have a caching, but either this is too small (I see 64 windows), or is not evicting cached windows in an LRU fashion (it evicts the last_used, is this evicting really the LRU, or the opposite?). Still the duration of 3 seconds for mmap/munmap seems a lot.
  2. parse_field() can probably be better. I see it runs memcmp() for every field loaded vs every field specified in the static list. This means that the memcmp() is actually multiplied by a factor of 14 (the static field list is 14, to be matched against all the fields in the database row). This is another 2 seconds. I would experiment with either a direct hash comparison (add the hash to the static list, compute the hash of every field and compare hashes), or even better, a simple hashtable.

Anyway, this is already much better. I tried to respect your coding style as much as possible. Unfortunately, I don't have time to work on this more. But journalctl needs a some more love, as it affects everyone out there. I hope I helped a little.

@mrc0mmand mrc0mmand added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Sep 23, 2023
@poettering
Copy link
Member

So I am not opposed to add an fstat optimization (and maybe the others, but can we please do this in separate PRs).

Also, the fstat optimization should really be done by default, and maybe something people can opt-out from via an env var. That's how we usually do things: make the default efficient, and make tunables we'd probably not want to to put front and center somewhat hidden via an env var.

(inotify generally does not suffice, since network file systems and such do not support it)

@@ -692,25 +692,39 @@ static int journal_file_verify_header(JournalFile *f) {
}

int journal_file_fstat(JournalFile *f) {
usec_t now_ut = now(CLOCK_MONOTONIC);

if(f->cache_fstat.done && now_ut - f->last_stat_usec < 1 * USEC_PER_SEC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between if and `(

@@ -697,10 +697,12 @@ int journal_file_fstat(JournalFile *f) {
assert(f);
assert(f->fd >= 0);

if (fstat(f->fd, &f->last_stat) < 0)
return -errno;
if(!f->last_stat_usec || !f->cache_fstat) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between if and (, i.e. follow our usual coding style

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally don't rely on C's downgrade to bool feature for numerical fields, like last_stat_usec. Please to an explicit comparison for that, i.e. use f->last-stat_usec == 0 || …

@poettering
Copy link
Member

btw, we do use inotify already for detecting changes. the fstat cache should be invalidated whenever we see an inotify change.

@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 28, 2023

@poettering I have added 2 fstat() related optimizatons:

  1. caching the timestamp of the latest message of each boot id involved in the query. This is not just fstat() caching.
  2. caching the fstat() call made to get the latest timestamp in a journal file.

The key problem here is "accuracy vs speed".

The way the code is currently in main, it is 100% accurate, meaning that even if things change during a query, the changes will be taken into account mid-query.

With the caching I added, these timestamps are assumed constant for the lifetime of the query.

When a user is executing a query to get some data, it is obvious that he wants fast results and even if things change mid query, it is ok to present output to the user as it was at the time the query started.

But when a user gets a sd_journal * for some work, I think it is wrong to assume that he/she is assuming things will not change for the lifetime of this pointer. He/she may get a very long standing sd_journal * that is reused for many queries across time. For this use case, caching will give wrong results.

Of course, since most users want to just do a fast query, we could flip the logic, to do fast queries by default and provide an option for increased accuracy for the other use cases. But this changes the semantics of the current behavior.

A better approach would be to cache by default, but hook cache flushing when a reset/restart is requested. Or even introduce a call for the callers to flush the caches when they believe they should.

Generally, I think we should carefully consider changing the default...

@ktsaou
Copy link
Contributor Author

ktsaou commented Sep 28, 2023

Created 2 new PRs with everything included here.

Closing this in favor of the others.

@ktsaou ktsaou closed this Sep 28, 2023
@github-actions github-actions bot removed please-review PR is ready for (re-)review by a maintainer ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR squash-on-merge labels Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants