-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Showing status of service via systemctl is slow (>10s) if disk journal is used #2460
Comments
is this on hdd or ssd? But yeah, we scale badly if we have too many individual files to combine. It's O(n) with each file we get... |
This one was on HDD, but now that I've looked into it it can be a bit (~0.5s, HDD swap, 300MB of logs on tmpfs) slow even with tmpfs if part that was loaded happened to be swapped out (I was testing on system with 2 weeks of uptime) Shouldn't there be some kind of index on journal files ? Or at the very least pointer in service entry to last log file that have relevant log entries . |
I think some kind of journal indexing is required because it's unbearably slow. Right now I've 5411 (43 GiB) journal files and $ time -p journalctl -b --no-pager > /dev/null
real 13.61
user 13.37
sys 0.22 it takes 13 seconds to just check current boot log while it's already cached in RAM. When it's not cached
this is on 2x 3TB HDD with RAID1 btrfs. |
It is laggy even when it is on tmpfs and machine runs long enough to swap it out. Why journald doesn't just use SQLite for storage ? It would be faster and other apps could actually use logfiles for something useful and have good query language instead of relying on a bunch of options in journalctl |
It is still slow as hell:
and it opens over hundred (on system that was up 2 hours)
|
@XANi @davispuh is there any chance you could run your slow cases under |
with systemd 233.75-3 on ArchLinux callgrind.out.systemctl-status-sshd.gz |
@davispuh Thank you for quickly providing the profiles. The It's not a panacea, but #6307 may improve the runtime of For anybody reading, the |
Emm, I can't test this anymore since after I compiled and reinstalled systemd it reset my |
With just complied from fdb6343 When files aren't cached it's really unusably slow
2nd time when it's cached it's quick
callgrind.out.systemctl-status-sshd_nocache.gz I've 7542 Basically to improve performance need to do less disk reading. Like use some kind of indexing or something like that. |
I have 240 system log files and 860 user log files. systemctl status OR journalctl -f take 2-4 minutes just to display logs. (HDD drive) I have added this in: /usr/lib/systemd/journald.conf.d/90-custom.conf
Systemd generates 2 to 3 system journal files everyday each of about 150994944 bytes size. Why doesnt journalctl -f (or systemctl) check only latest / current journal? How do I make it efficient and fast? I need to preserve logs for long duration. In most cases most people only have to check recent logs only. May be some feature to have automatic archival of logs in different directory (/var/log/journal/ID-DIRECTORY/archive) and current logs (say past 3-7 days) kept in /var/log/journal/ID-DIRECTORY? This will speed up journalctl and systemctl status a lot. Anyone want to check archived logs can use --directory option of journalctl |
I have the same problem, i'm on a vmware vm on a hdd san example:
journal size is: 101.1GB right now. |
journalctl has --file parameter. I am able to use it to search faster. while:
Similarly can we make This will drastically speed things up. If admin wants older status he can supply PS: I have no idea how data is stored in journal. @poettering do u want me to create RFE for this? PPS: Now every time I run |
Yes this is a problem for servers storing their logs. My biggest problem is the centralized log server, i receive logs from network equipment using rsyslog, and it uses omjournal to pipe them directly into journal. It works fine to begin with but then degrades quickly (note i'm doing this as a test server, we have another server where rsyslog writes to files). Maybe journal files could be made to contain specific timespans, and only get loaded if requested, i use stuff like |
@amishxda IMO It should only fetch more log lines from journalctl if explictly requested by user @Gunni I don't think using systemd for centralized log server is intended, or good idea in the first place. ELK stack is much more useful for it, jankiness aside Logstash allows to do a lot of nice stuff, for example we use it to split iptables logs into fields (srcip/dstip/port etc) before putting in ES. And ES just works better for search |
@XANi I just expected it to be a supported use case since About the tools you mentioned, setting up all that stuff sounds like much more work, especially since we like to be able to watch the logs live |
@Gunni I wish journald would just use sqlite instead of its current half-assed binary db. I feel like currently it is just trying to reimplement that but badly. And there is a plenty of tools for querying sqlite already. ELK stack is definitely more effort to put in but in exchange it has a ton of nice features to look at, we for example made logstash do geoIP lookup on any IP that's not private so each firewall log gets that info added. Querying is also very nice as you can make a queries on fields directly instead of text strings |
I found out about this issue via https://www.reddit.com/r/linuxadmin/comments/gdfi4t/how_do_you_look_at_journald_logs/ . Is this still a problem? |
@otisg as of systemd 241 (that's just version I have on machine that actually keeps logs on disk), it is most definitely still a problem (all timing tests done right after dropping cache, ~800MB in journal):
now the fun part :
Yes, you are reading this right, getting a last few lines of a currently running service opens every single fucking entry in the journal dir Now if I do just
This shit manages to be order of magnitude slower than "just bruteforce grepping last logrotate's worth of text logs. This is amount of retardness that's fucking mind-boggling. The sheer fact developers decided to go with binary format yet not bother by even introducing time or service-based sharding/indexing and just bruteforce every existing file is just insane. It is like someone, instead of considering reasonable options like, dunno:
They decided one evening "you know, I always wanted to make a binary logging format", then got bored after few weeks and never touched it again. |
Ouch. I had not realized things were so slow. So I'm amazed why so many people at https://www.reddit.com/r/linuxadmin/comments/gdfi4t/how_do_you_look_at_journald_logs/ said they consume journal logs via journalctl. Are they all OK with slowness?!? Why don't more people get their logs out of journal, centralize them in an on-prem or SaaS service that is faster? Anyhow, I see some systemd developers here. I wonder if they plan on integrating something like https://github.com/tantivy-search/tantivy ... |
The first invocation is slow. It probably goes through the journal files and checks them. Then things are fast for a while. journald is not great for log management, and it's simply not fit for any kind of centralized log management at scale. But it's very likely not a goal of the systemd project to handle even that too. The journal is a necessity, just as pid1, udev, and network setup (for remote filesystems) to manage a Linux system and its services reliably. That said it's entirely likely that with a few quick and dirty optimizations this could be worked around. (Eg. if journal files are not in cache, don't wait for them when showing status; allow streaming the journal without looking up the last 10 lines, persisting some structures to speed up journal operations, enabling unverified journal reads by default, etc.) |
@vcaputo Correct me if I'm wrong but that patch only searches in current boot id, so if system is running for a long time it wouldn't change anything ? I have encountered the problem on server machines in the first place (and on personal NAS) so in almost every case the current boot ID is only one in the logs. It certainly would help on desktop but that's not where I hit the problem in the first place (also AFAIK most desktop distros don't have /var/log/journal by default so journalctl doesn't log to HDD in the first place). The other problem is that current implementation makes it really easy for one service to swamp the logs to the point you lose any other service's logs. It is especially apparent for services that don't emit much logs, and even tho there is zero actual logs for the service (as they get rotated out) it still takes about the same amount of time as for any other service. For reference I have it happening on machine with just 1GB of journal and last few days of logs there. |
@XANi Yes you're right, if all the journals are from the same boot the early exit never occurs. FTR it already matches the current boot id, the patch just moves the order around. The change assumes there are probably multiple boots represented in the archived journals. |
The |
It should just write to SQLite files not to this godawful hacked together format. Also ability to just query the log file via SQL would have been divine. |
That code didn't even exist when this issue was created/discussed. It was introduced in 2023 by 34af749 So at best #30209 undoes a later perf regression introduced by adding that stat storm in 2023, it's not causal for what's going on here. |
For the perspective:
Tha't on quite recent systemd-254.7-1.fc39.x86_64, on nvme-cached, btrfs raid-1 on 2 spinning Seagate IronWolf Pro. |
I wish it was only 20s, try to use journal when
And this is after running it once before, first time it took over 8mins. And my whole Looking at Also |
@vcaputo v254 is as slow as it was when I first reported that bug. Back then it took 12.5s to get the status out of 4GB of journals, now it takes 5s to do the same on 1GB (I have since then limited journal size to not make using it be so awful) It's impossible for it to be fast if it doesn't keep track what was last log that given app used or had proper indexing (and at that point just use SQLite). |
It's not like the journal lacks any acceleration structures, IIRC the main expected source of poor scaling is when there's a large number of journals participating in the query since that scales O(N) where N is the number of journals. Per-journal the entry arrays are binary-searched and grow exponentially, so for any given journal it should be a relatively quick sequence iterating the start to the last entry array, no? Perhaps I'm misremembering the details here. I feel like there might be some value in doing some instrumenting and verifying the amount of journal data accessed is at least consistent with what's expected for the existing file format. It's possible to accidentally have extraneous work performed while still producing correct results with this type of thing. |
The edge case is where journal doesn't have any logs for a given service. Imagine server where some of the services are writing logs regularly, while other might not output anything for days or weeks (say nginx which only emits logs on start and end, and otherwise writes its own log files elsewhere). Once you get to the point where the "silent" service gets rotated out of journald files and it takes long time to get even the last few lines it emitted. I also have no idea why plain grepping the same journal files is faster... |
Except that doesn't really make sense, based on my recollection of how this should work. If that's what's happening in that edge case, I think it's a bug in the implementation. The match of _BOOT_ID is ANDed with all the other conditions. Your edge case would be where all the other conditions fail to advance the location, and your description makes it sound like everything matching the _BOOT_ID is then getting visited anyways. My recollection is that this should just stop the matching from progressing further at all, short-circuiting without visiting all the entries for the _BOOT_ID. It could just be a silly bug to go fix. Try running the |
It does look like it is. But even if it was not it would mean that just one spammy app would make any status/log request for any other app slow. At the very least each service should store the pointer to last file it had its logs in so it could be short circuited instead of siting thru gigabytes of logs just to display its status. Or hell, slap a filename over
As you wish (systemd from Debian 12, 242.6):
Lo and behold, every single file visited. Here is full log: log.txt Journal filter:
But let's test more. let's restart nginx so there are some fresh logs in journal:
But hey, maybe that's not enough new lines and it have to get all the journal to return more than just "Starting nginx" ?
Still scans every file even if it has recent logs available, despise command needing like |
Visiting every file is expected, they all participate in the matching process, that's by design. What's unexpected is visiting every entry in every journal. It shouldn't have to read in the entirety of every journal file to answer a I feel like we're talking past eachother and the issue's going in circles, but I don't have time right now to dig into the code and run this to ground. My efforts historically mostly focused on the journald/writer side of things, not the consumer/reader side. |
I've stopped all the services to not fuss with io readings of test now and:
Out of 1.3 GB of journal it read ~450MB of it. So yeah, while it isn't visiting every single byte of the file, it's still terrible
Well, the way it is structured is pretty much unfixable to be efficient at finding log for a single service. |
Well that's still a lot different from reading all of the entries, and indicates to me that no, we don't have a silly bug here to go fix. An important implementation detail to keep in mind is the journal-file's use of mmap, relying heavily on the kernel's async read-ahead/around to achieve some semblance of decent performance. The kernel doesn't have any knowledge of the journal-file layout/access patterns, and the layout isn't optimized to try pack as much query-relevant data contiguously for any given kernel read-ahead/around. So we pull in a lot of "chaff" in those async reads triggered by the "wheat" accesses. If you tune the kernel to use smaller read-ahead/around sizes, you'll find the performance gets even worse, despite reading much less data from disk. This is because the userspace code accessing the data via mmap becomes blocked more often waiting on the granular faults, in a "death by a thousand papercuts" fashion, by becoming more synchronous with the kernel handling the faults. I wonder if |
Yes, it is not "silly bug", it's just a fundamental fault with log format and way it is written and accessed. just having logfile-per-service would cut that by orders of magnitude. Hell, even hashing logfile names ("service x goes into hash(servicename).[0..3])") would cut it by orders of magnitude. Check status of a single service is the most often use of for checking the logs and somehow happens to also be worst case scenario for this dreary design. |
As another point of comparison: I need to keep a lot of logs (tens of GB), and I regularly need to extract logs within a 24-48 hour span from about 2-4 weeks ago (from single service). I had to set up syslog-ng, the functionality of which I didn't really need, but I had to use it because extracting logs from journald in the scenario described above can take almost half an hour. This setup created a lot of journal files, but they were on a fast nvme ssd. So in syslog-ng I set up mysql backend and text log per service with logrotate as backup. I haven't even had to use mysql log yet, because searching multiple rotated files of a single service log (to find the start date of the range I need to extract) with ripgrep takes several seconds (!) instead of tens of minutes. Given that logrotate rotates logs every week, and I keep them for at least a year, this is also a lot of files that kernel has no knowledge of to create efficient caches, let alone that ripgrep doesn't know that file lines are sorted by date, but it still performs a giant lot better. |
I did some playing around and results has been quite interesting: The data journalctl returns:
6.5MB worth of text, 44k lines of text. It takes (freshly cleaned out before test, but same ratios as when it got to 1GB):
now I assume binary log entries have a bit more metadata but I've seen anywhere betweeh ~850 to >1k per log entry when it got to 1GB allowed for /var/log/journal, which seems excessive. So yeah, no wonder it is slow if each entry takes so much more than actual info in it. That on top of no way to lookup known for comparision, just dumping it to SQLite database:
(there is more log entries because it is sucked off rsyslog that does some remote logging for me) |
More log fun, whenever I try to query service status I get
It took forever to even tell me "No entries" and I know it's lying, there for sure are some entries... |
same issue |
It also told you why it couldn't produce meaningful output, but you chose to ignore that part:
If you have a custom journald config. that produces that many files, you should rethink it. Running |
@dtardon why it can't just shard by app name ? Current approach is excessively wasteful in every aspect. It's equivalent of selecting entire database any time you need a record by one of mainly used columns... |
I don't understand what you mean by this...
To get the last few records, it's not necessary to read all journal files in full. But all journal files have to be opened to figure out where these last few records are. Which can be a problem if there are too many of them there, like in your case. (In fact, you have even more files than are allowed by the |
write to journal file named So if hashed service name hashes to da39a3ee5e6b4b0d3255bfef95601890afd80709 the file name would be That's just ugly workaround but really anything is better than current garbage implementation. Hell, just save a name last file service wrote log into together with rest of the service info.
In case of last log entry being few days ago (or rotated out) that still ends up being majority of logs being read. For my 1GB of logs asking for app with rotated out logs does around 440MB of reads on simple
Wrong person, I'm not the guy hitting 7k limit. I have only 1GB of journal and it still takes multiple seconds on my NAS, because it is a NAS so all free ram gets used to caching other stuff, so the moment I need to use any of the systemd commands touching logs there is none of older logs in disk cache and it takes forever |
It would also produce 256 times more files, increasing disk usage (active journal files have a fixed minimal size--4 MB IIRC) and making hitting the file limit more probable. That's not mentioning that
There are more use cases for journal than showing a few lines of log in |
All files has to be opened in absolutely worst case. In the most common scenario it should be enough to check most recent journal file and find that it has enough entries. And the most recent file can be found just by ordering them by modification time, it needs very small amount of IO. Reading from all journal files for each status report is extremely inefficient. |
Suddenly the oldest archived journal file is the most recent... And that's just an artificial counter-example. There are completely legitimate time shifts (e.g., DST changes) that may cause the active journal file not to be the most recent one. The |
Well there are should be ways to optimize this. At least journalctl can safely assume that current active journal file is newer then all archived files. Thus if active journal file contains enough records to populate status output, then archived files should not be accessed.
|
When reading logs I have the expectation to see them in the order they are produced. If there are time jumps during that time, I am OK with it. I imagine old-school admins who are used to tailing /var/log/messages would expect the same thing.
There is also a reasonable expectation that the system time is monotonic. I cannot think of a "legitimate" time shift? DST is certainly not it, because the timestamps (mtimes) are recorded as Unix times which is not affected by DST. E.g., here is an example of DST moving back while Unix time is moving forward
|
With big (4GB, few months of logs) on-disk journal,
systemctl status service
becomes very slowit is of course faster after it gets to cache... for that service, querying other one is still slow.
Dunno what would be right way to do it.. but opening ~80 log files to just display service status seems a bit excessive
The text was updated successfully, but these errors were encountered: