New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Journalctl: 13x faster by caching fstat() and optimizing timestamps retrieval #29261
Conversation
https://lwn.net/Articles/944214/ |
Instead of running I was hopping that this would solve the unit tests problem too. However, it seems that public functions need something more in order for CI to complete. BTW, with the latest update the time went down to 26 seconds, because even the clock is not queried all the time. I also tested the same query with the netdata plugin. It needs 16 seconds for traversing the same data and all the available fields, and on-the-fly indexing of all field values. In the above chart, the right half part is libsystemd, the left is our indexing library. The difference is tremendous (15 vs 26 seconds). So, I understand |
I pushed a change to remove the previous call I added ( |
The delta from 15 to 26 seconds, is because of
However, Ideally, the logic of finding the timestamps should be integrated into the loops the output functions do, so that the fields are traversed only once. This is important, it accounts for about 42% of the total time |
I changed I have not changed the other 4 output formats. So, New Comparison (new dataset)Current masterThe current main branch needs 371 seconds (6+ minutes): This PRThe new code needs 28.29 seconds, 13.3 times faster! The output of the current main branch and this PR is exactly the same, line-by-line. Now the performance of |
I see 2 additional problems that may hurt performance:
Anyway, this is already much better. I tried to respect your coding style as much as possible. Unfortunately, I don't have time to work on this more. But |
So I am not opposed to add an fstat optimization (and maybe the others, but can we please do this in separate PRs). Also, the fstat optimization should really be done by default, and maybe something people can opt-out from via an env var. That's how we usually do things: make the default efficient, and make tunables we'd probably not want to to put front and center somewhat hidden via an env var. (inotify generally does not suffice, since network file systems and such do not support it) |
@@ -692,25 +692,39 @@ static int journal_file_verify_header(JournalFile *f) { | |||
} | |||
|
|||
int journal_file_fstat(JournalFile *f) { | |||
usec_t now_ut = now(CLOCK_MONOTONIC); | |||
|
|||
if(f->cache_fstat.done && now_ut - f->last_stat_usec < 1 * USEC_PER_SEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between if
and `(
@@ -697,10 +697,12 @@ int journal_file_fstat(JournalFile *f) { | |||
assert(f); | |||
assert(f->fd >= 0); | |||
|
|||
if (fstat(f->fd, &f->last_stat) < 0) | |||
return -errno; | |||
if(!f->last_stat_usec || !f->cache_fstat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between if
and (
, i.e. follow our usual coding style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we generally don't rely on C's downgrade to bool feature for numerical fields, like last_stat_usec
. Please to an explicit comparison for that, i.e. use f->last-stat_usec == 0 || …
btw, we do use inotify already for detecting changes. the fstat cache should be invalidated whenever we see an inotify change. |
@poettering I have added 2
The key problem here is "accuracy vs speed". The way the code is currently in main, it is 100% accurate, meaning that even if things change during a query, the changes will be taken into account mid-query. With the caching I added, these timestamps are assumed constant for the lifetime of the query. When a user is executing a query to get some data, it is obvious that he wants fast results and even if things change mid query, it is ok to present output to the user as it was at the time the query started. But when a user gets a Of course, since most users want to just do a fast query, we could flip the logic, to do fast queries by default and provide an option for increased accuracy for the other use cases. But this changes the semantics of the current behavior. A better approach would be to cache by default, but hook cache flushing when a reset/restart is requested. Or even introduce a call for the callers to flush the caches when they believe they should. Generally, I think we should carefully consider changing the default... |
Created 2 new PRs with everything included here. Closing this in favor of the others. |
I noticed that
journalctl
is quite slow when there several million lines available to be queried.To emulate some load I used this on my development machine:
This gives a little more than 4.7 million log entries (produces output of 765 MiB) and the current main branch needs a little more than 123 seconds to complete.
I compiled it with this:
and run it through a profiler, to get this:
Note the big yellow lines of
fstatat
. So, it seems that although my system has just a bunch of files,journalctl
repeatedly runfstat
on them.click to see my `/var/log/journal`
# tree /var/log/journal /var/log/journal ├── 929c259028424a2da82cc8cd09e0bca1 │ ├── system@000604b194adb5b4-432725c4395f2ced.journal~ │ ├── system@000604b29619ad8f-41625a1016b4fa3f.journal~ │ ├── system@000605c820ed5e4e-0e2195376e8833a6.journal~ │ ├── system@000605c89e3ac856-97082da1c64c70bd.journal~ │ ├── system@000938c3307c48fe805e8f1503db53cd-00000000017f07e1-000605c9ec86601e.journal │ ├── system@000938c3307c48fe805e8f1503db53cd-00000000017f0e44-000605c9ecbdd4a2.journal │ ├── system@3a9c2394c05f486fbd9443cdfe1780b5-00000000017f5dde-000605ca7d212d36.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000137bdab-0006041a0bf1092c.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001385605-0006043229dbdde8.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001385cab-000604322a0ca39d.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013a35b6-0006047546c513b6.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013c117f-000604b194ad7d1e.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013c3ef4-000604b296197720.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013ca6da-000604c36d065146.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013cb281-000604c36dad7ecf.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000013eb995-0006050089b2e51f.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001407ff2-00060539c8321794.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000014086d0-00060539c86103d4.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000142d399-0006055d75b014c8.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000014558c9-00060567760d0535.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000147f273-00060570073e46ee.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000014a83b9-000605760d2c26ba.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000014d0262-0006057fb0cfed1a.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000014f9aed-000605835126e06f.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001523597-0006058699943758.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000154d0e3-00060589b9c62615.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001576c3f-0006058cc51a3d60.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000015a07ff-0006058f7589d1cb.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000015ca4fb-00060591f7c690be.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000015f4295-00060594622db439.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000161dfc9-0006059693f390ed.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001647dbe-00060598a1f55c31.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001671c1b-0006059aa11305a7.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000169ba56-0006059c92c625e9.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000016c6792-000605a40540f5a0.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000016eeee7-000605ac30c09239.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-000000000171850e-000605b0b80f228d.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-00000000017413d5-000605b6176e33b1.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001769a94-000605bbc6f9d29b.journal │ ├── system@63bb8f3469bc4275b6047431486909ea-0000000001792d8a-000605c139b941cf.journal │ ├── system@67d8a2003bb847e0bf61434bfc9efc48-00000000017da029-000605c83724ebad.journal │ ├── system@67d8a2003bb847e0bf61434bfc9efc48-00000000017da767-000605c837597e5a.journal │ ├── system@839368cfceef4c58b7134b01deac2654-00000000017e05ac-000605c89601394e.journal │ ├── system@839368cfceef4c58b7134b01deac2654-00000000017e0c96-000605c896322abb.journal │ ├── system@83f233550ffb4274ba3c19e81986198a-00000000017d4e9a-000605c820ed296a.journal │ ├── system@b13ced2012b6441bbbbaad080611fdbd-00000000017ede54-000605c98c353cc5.journal │ ├── system@b13ced2012b6441bbbbaad080611fdbd-00000000017ee4a4-000605c98c68acac.journal │ ├── system@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6827-000605c92539822d.journal │ ├── system@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6fd8-000605c9256f59f8.journal │ ├── system@db4726d85842415b932cc3811c064b5a-00000000017df0d9-000605c8926b42a1.journal │ ├── system@db4726d85842415b932cc3811c064b5a-00000000017df817-000605c8929b665b.journal │ ├── system@e5806de69bae41668771ad3cb7d27443-00000000017e1f96-000605c89e3a8fab.journal │ ├── system@e5806de69bae41668771ad3cb7d27443-00000000017e26b3-000605c89e69bc93.journal │ ├── system.journal │ ├── user-1001@000604b18099c1eb-0a6766d61185ee96.journal~ │ ├── user-1001@000604b194d0bbb6-381638a635ac22c5.journal~ │ ├── user-1001@000604b28e8f26b5-f8c6939712f9e86d.journal~ │ ├── user-1001@000604b2a0d6ff56-2dd3405342cb0375.journal~ │ ├── user-1001@000605c8211d7d03-75f549522714e133.journal~ │ ├── user-1001@000938c3307c48fe805e8f1503db53cd-00000000017f0e43-000605c9ecba7eb5.journal │ ├── user-1001@67d8a2003bb847e0bf61434bfc9efc48-00000000017da766-000605c8375683b8.journal │ ├── user-1001@839368cfceef4c58b7134b01deac2654-00000000017e0c95-000605c8963039b7.journal │ ├── user-1001@83f233550ffb4274ba3c19e81986198a-00000000017d55e7-000605c8211d7a33.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000137bdaa-0006041a0bed0da9.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001385caa-000604322a04b6db.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013c4b40-000604b2a0d6fda8.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013cb280-000604c36dab247e.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000013eb994-0006050089af26cb.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014086cf-00060539c8558450.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000142d398-0006055d75ac39c2.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014558c8-0006056776093fbf.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000147f272-00060570073aab1e.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014a83b8-000605760d2892ca.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014d0261-0006057fb0cc6d4a.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000014f9aec-000605835123b98a.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001523596-000605869990e084.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000154d0e2-00060589b9c2a320.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001576c3e-0006058cc516ff32.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015a07fe-0006058f758696b5.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015ca4fa-00060591f7c355e5.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000015f4294-00060594622a5997.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000161dfc8-0006059693f03440.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001647dbd-00060598a1f1f93e.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001671c1a-0006059aa10fa18c.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000169ba55-0006059c92c2da9c.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000016c6791-000605a4053c8ef3.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000016eeee6-000605ac30bce604.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-000000000171850d-000605b0b80bf167.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-00000000017413d4-000605b6176aaeb1.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001769a93-000605bbc6f61ff4.journal │ ├── user-1001@99ce63d7aa024482a2d09efd2ae124bb-0000000001792d89-000605c139b5c24a.journal │ ├── user-1001@b13ced2012b6441bbbbaad080611fdbd-00000000017ee4a3-000605c98c64ace0.journal │ ├── user-1001@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e6fd7-000605c9256ce2c0.journal │ ├── user-1001@db4726d85842415b932cc3811c064b5a-00000000017df816-000605c892981c8a.journal │ ├── user-1001@e5806de69bae41668771ad3cb7d27443-00000000017e26b2-000605c89e67dd9d.journal │ ├── user-1001.journal │ ├── user-1002@67d8a2003bb847e0bf61434bfc9efc48-00000000017dae06-000605c8396da9fa.journal │ ├── user-1002@83f233550ffb4274ba3c19e81986198a-00000000017d709b-000605c824370233.journal │ ├── user-1002@b13ced2012b6441bbbbaad080611fdbd-00000000017eea65-000605c98d489d57.journal │ └── user-1002@ce2abdea1f954e4ebecdaa4f35ec4ad2-00000000017e7f41-000605c926fa1e58.journal └── remote 3 directories, 100 files
OPTIMIZATION 1
I noticed that
compare_boot_ids()
is responsible for about 40% of execution time.In this PR I added a simple hashtable that maps
boot_ids
tonewest
(machine_id, realtime_usec). This hashtable preallocates 1000 pointers atsd_journal
(8KiB on 64bit machines). On each slot a linked list ofstruct boot_id_newest_entry
may exist. For the hash I used something trivial, like(boot_id.qwords[0] | boot_id.qwords[1]) % BOOT_ID_TO_NEWEST_HASHTABLE_SIZE
.Of course, the memory allocated is freed when
sd_journal
is closed.OPTIMIZATION 2
The next problem was
journal_file_read_tail_timestamp()
. This function callsjournal_file_fstat()
everytime it is called, which results in really a lotfstat()
calls.The solution was simple. Inside
JournalFile
you already have thelast_stat_usec
member. Since you already do anow(CLOCK_MONOTONIC)
for every invocation, I used 2 new members to cache the lastfstat()
results for up to 1 second. If 1 second passes since the last time the file wasfstat()
, the call is made again.IMPROVEMENT
From 123 seconds, the time is now reduced to 29 seconds. More than 4 times faster. I would say a very good improvement for just a little memory.
Of course I also checked the output of both runs, and they are line-by-line, the same.
Note that still
__clock_gettime
gets about 2 seconds of the 29. This is really a lot of clock retrieval requests. This call is already in current main branch, but hidden becausefstat()
takes the premium.I guess, a better solution (to avoid also these clock calls) would be to use
inotify
to receive updates about the directories you are interested and avoid doing all thesefstat()
calls all the time.The biggest changes in this PR are:
real_time_usec
per boot id, is cached for the lifetime ofsd_journal
.fstat()
, at most once per second.I have not stepped too much into your code, so feel free to close this and implement it your own way.
To my understanding 29 seconds for 4.7 million entries are still a lot. Yes, I know the number of operations is tremendous. At Netdata we implemented an on-the-fly indexing for systemd-journal logs, using the public APIs, to produce something like this (note than it has counters for every label value, so it traverses all available fields):
In the final graph, I see that
sd_journal_enumerate_data()
is called twice. Once byget_display_timestamp()
and another byoutput_short()
, for about 6 seconds each (actually finding the timestamp is more expensive than finding the MESSAGE). Merging the 2 calls may require some some refactoring of the public APIs, but I think it probably justifies the effort if the time can further be reduced by 20 - 30%.