Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing status of service via systemctl is slow (>10s) if disk journal is used #2460

Open
XANi opened this issue Jan 28, 2016 · 67 comments
Open

Comments

@XANi
Copy link

XANi commented Jan 28, 2016

With big (4GB, few months of logs) on-disk journal, systemctl status service becomes very slow

 (!) [13:37:30]:/var/log/journal☠ time  systemctl status nginx
* nginx.service - A high performance web server and a reverse proxy server
   Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2016-01-28 08:25:28 CET; 5h 12min ago
 Main PID: 3414 (nginx)
   CGroup: /system.slice/nginx.service
           |-3414 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
           |-3415 nginx: worker process
           |-3416 nginx: worker process
           |-3417 nginx: worker process
           `-3418 nginx: worker process

Jan 28 08:25:12 ghroth systemd[1]: Starting A high performance web server and a reverse proxy server...
Jan 28 08:25:28 ghroth systemd[1]: Started A high performance web server and a reverse proxy server.

real    0m12.505s
user    0m0.016s
sys 0m0.056s
 (!) [13:35:14]:/var/log/journal☠ du -h --max=1
4,2G    ./46feef66e59512fcd99c7ddc00000108
4,2G    .
 (!) [13:40:53]:/var/log/journal☠ strace systemctl status nginx 2>&1 |grep open |grep /var/log |wc -l
88

it is of course faster after it gets to cache... for that service, querying other one is still slow.

Dunno what would be right way to do it.. but opening ~80 log files to just display service status seems a bit excessive

@poettering
Copy link
Member

is this on hdd or ssd?

But yeah, we scale badly if we have too many individual files to combine. It's O(n) with each file we get...

@XANi
Copy link
Author

XANi commented Jan 28, 2016

This one was on HDD, but now that I've looked into it it can be a bit (~0.5s, HDD swap, 300MB of logs on tmpfs) slow even with tmpfs if part that was loaded happened to be swapped out (I was testing on system with 2 weeks of uptime)

Shouldn't there be some kind of index on journal files ? Or at the very least pointer in service entry to last log file that have relevant log entries .

@davispuh
Copy link

davispuh commented Oct 1, 2016

I think some kind of journal indexing is required because it's unbearably slow.

Right now I've 5411 (43 GiB) journal files and

$ time -p journalctl -b --no-pager > /dev/null
real 13.61
user 13.37
sys 0.22

it takes 13 seconds to just check current boot log while it's already cached in RAM.

When it's not cached

# echo 3 > /proc/sys/vm/drop_caches
$ time -p journalctl -b --no-pager > /dev/null
real 69.29
user 13.51
sys 12.34

this is on 2x 3TB HDD with RAID1 btrfs.

@XANi
Copy link
Author

XANi commented Oct 1, 2016

It is laggy even when it is on tmpfs and machine runs long enough to swap it out.

Why journald doesn't just use SQLite for storage ? It would be faster and other apps could actually use logfiles for something useful and have good query language instead of relying on a bunch of options in journalctl

@XANi
Copy link
Author

XANi commented Jun 21, 2017

It is still slow as hell:

time systemctl status kdm|cat
* kdm.service - LSB: X display manager for KDE
   Loaded: loaded (/etc/init.d/kdm; generated; vendor preset: enabled)
   Active: active (exited) since Wed 2017-06-21 12:03:26 CEST; 1h 42min ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/kdm.service

Jun 21 12:03:25 ghroth systemd[1]: Starting LSB: X display manager for KDE...
Jun 21 12:03:26 ghroth systemd[1]: Started LSB: X display manager for KDE.

real	0m2.873s
user	0m0.008s
sys	0m0.016s

and it opens over hundred (on system that was up 2 hours)

strace systemctl status kdm 2>&1 |grep open |wc -l
133

@vcaputo
Copy link
Member

vcaputo commented Jul 7, 2017

@XANi @davispuh is there any chance you could run your slow cases under valgrind --tool=callgrind and supply us with the the output? I've seen some pathological cases where journalctl is spending exorbitant (~50% of its execution) amounts of CPU time in the siphash24 code when the simple context cache in the mmap-cache is experiencing high miss rates, I'm curious if that's also something you guys are observing here.

@davispuh
Copy link

davispuh commented Jul 8, 2017

with systemd 233.75-3 on ArchLinux

callgrind.out.systemctl-status-sshd.gz
callgrind.out.journalctl-b--no-pager.gz

@vcaputo
Copy link
Member

vcaputo commented Jul 8, 2017

@davispuh Thank you for quickly providing the profiles. The callgrind.out.systemctl-status-sshd.gz profile shows mmap_cache_get() somewhat prominently with the hashmap maintenance being a significant part of that.

It's not a panacea, but #6307 may improve the runtime of systemctl status sshd for you. Would you be up for some testing? Some time systemctl status sshd --no-pager comparisons before and after would be great.

For anybody reading, the kcachegrind utility works quite well for visualizing callgrind.out files.

@davispuh
Copy link

davispuh commented Jul 9, 2017

Emm, I can't test this anymore since after I compiled and reinstalled systemd it reset my /etc/systemd/journald.conf settings and so systemd deleted all old journals and it's fast now again. Basically it's slow only when you've a lot of journal files in /var/log/journal/

@davispuh
Copy link

davispuh commented Oct 5, 2017

With just complied from fdb6343

When files aren't cached it's really unusably slow

$ time systemctl status sshd --no-pager
0.28user 34.82system 4:34.32elapsed 12%CPU (0avgtext+0avgdata 1355416maxresident)k
3242224inputs+0outputs (10002major+17658minor)pagefaults 0swaps

2nd time when it's cached it's quick

$ time systemctl status sshd --no-pager
0.09user 0.33system 0:00.43elapsed 99%CPU (0avgtext+0avgdata 1303540maxresident)k
0inputs+0outputs (0major+23705minor)pagefaults 0swaps

callgrind when it's not cached and when it is

callgrind.out.systemctl-status-sshd_nocache.gz
callgrind.out.systemctl-status-sshd_cached.gz

I've 7542 *.journal files in /var/log/journal/<ID> which is on BTRFS RAID1 partition (2x 3TB HDD)

Basically to improve performance need to do less disk reading. Like use some kind of indexing or something like that.

@amishmm
Copy link

amishmm commented Mar 31, 2018

I have 240 system log files and 860 user log files.

systemctl status OR journalctl -f take 2-4 minutes just to display logs. (HDD drive)

I have added this in: /usr/lib/systemd/journald.conf.d/90-custom.conf

[Journal]
SystemMaxUse=100G
SystemMaxFileSize=200M
SystemMaxFiles=1100
MaxFileSec=1week
MaxRetentionSec=3year

Systemd generates 2 to 3 system journal files everyday each of about 150994944 bytes size.

Why doesnt journalctl -f (or systemctl) check only latest / current journal?

How do I make it efficient and fast? I need to preserve logs for long duration.

In most cases most people only have to check recent logs only.

May be some feature to have automatic archival of logs in different directory (/var/log/journal/ID-DIRECTORY/archive) and current logs (say past 3-7 days) kept in /var/log/journal/ID-DIRECTORY?

This will speed up journalctl and systemctl status a lot. Anyone want to check archived logs can use --directory option of journalctl

@Gunni
Copy link

Gunni commented Apr 20, 2018

I have the same problem, i'm on a vmware vm on a hdd san

example: time systemctl status systemd-journald.service

real    0m50.484s
user    0m0.070s
sys     0m3.492s

journal size is: 101.1GB right now.

@amishmm
Copy link

amishmm commented Jun 5, 2018

journalctl has --file parameter. I am able to use it to search faster.

while:
journalctl -f took ages

journalctl --file /var/log/journal/*/system.journal -f takes just 1-2 seconds.

Similarly can we make systemctl status use ONLY system.journal by default? Because in most cases administrator check status only after systemctl (re)start or when something unexpected happens (which is also likely to be logged in system.journal unless it rotated just recently)

This will drastically speed things up.

If admin wants older status he can supply --lines=N in which case systemctl will scan through older journals too. (may be in reverse order) OR admin can use journalctl -n N -u service instead

PS: I have no idea how data is stored in journal.

@poettering do u want me to create RFE for this?

PPS: Now every time I run systemctl status it becomes as good as DoS attack!

@Gunni
Copy link

Gunni commented Jun 5, 2018

Yes this is a problem for servers storing their logs.

My biggest problem is the centralized log server, i receive logs from network equipment using rsyslog, and it uses omjournal to pipe them directly into journal. It works fine to begin with but then degrades quickly (note i'm doing this as a test server, we have another server where rsyslog writes to files).

Maybe journal files could be made to contain specific timespans, and only get loaded if requested, i use stuff like --since and --until a lot but since they are affected by this, it doesn't help, but if the file method were used, journalctl could find them in fractions of seconds, -f should just be instant really, find the last 10 entries, and only search backwards in time if nothing happened maybe with a cutoff.

@XANi
Copy link
Author

XANi commented Jun 6, 2018

@amishxda IMO status by default should just... not have to touch on-disk logs, ever. Storing few last lines of log in memory per service wouldn't be that memory hungry, wouldn't trash OS disk cache on status and it would return something more useful than "welp, logs were rotated, too bad" that it currently does when server runs for some time.

It should only fetch more log lines from journalctl if explictly requested by user

@Gunni I don't think using systemd for centralized log server is intended, or good idea in the first place. ELK stack is much more useful for it, jankiness aside Logstash allows to do a lot of nice stuff, for example we use it to split iptables logs into fields (srcip/dstip/port etc) before putting in ES. And ES just works better for search

@Gunni
Copy link

Gunni commented Jun 6, 2018

@XANi I just expected it to be a supported use case since systemd-journal-remote is a thing but if this is the expected performance, ofcourse. But if it is improved, maybe add a better index or a database backend to journal then i'd be able to use it.

About the tools you mentioned, setting up all that stuff sounds like much more work, especially since we like to be able to watch the logs live journalctl SRC=net -f (maybe with a grep), the second they arrive, my expirience with what you mentioned has been long delay from event to user. But i'll look into it.

@XANi
Copy link
Author

XANi commented Jun 7, 2018

@Gunni I wish journald would just use sqlite instead of its current half-assed binary db. I feel like currently it is just trying to reimplement that but badly. And there is a plenty of tools for querying sqlite already.

ELK stack is definitely more effort to put in but in exchange it has a ton of nice features to look at, we for example made logstash do geoIP lookup on any IP that's not private so each firewall log gets that info added.

Querying is also very nice as you can make a queries on fields directly instead of text strings

@otisg
Copy link

otisg commented May 4, 2020

I found out about this issue via https://www.reddit.com/r/linuxadmin/comments/gdfi4t/how_do_you_look_at_journald_logs/ . Is this still a problem?

@XANi
Copy link
Author

XANi commented May 4, 2020

@otisg as of systemd 241 (that's just version I have on machine that actually keeps logs on disk), it is most definitely still a problem (all timing tests done right after dropping cache, ~800MB in journal):

$ time systemctl status collectd  2>&1 >/dev/null
real	0m5,921s
user	0m0,004s
sys	0m0,180s

$ strace -f    systemctl status collectd  2>&1 |grep open |grep /var/log/journal |wc -l
721

now the fun part :

find /var/log/journal |wc -l

Yes, you are reading this right, getting a last few lines of a currently running service opens every single fucking entry in the journal dir

Now if I do just

$ time grep -R collectd /var/log/*log >/dev/null

real	0m0,370s
user	0m0,010s
sys	0m0,009s

This shit manages to be order of magnitude slower than "just bruteforce grepping last logrotate's worth of text logs. This is amount of retardness that's fucking mind-boggling. The sheer fact developers decided to go with binary format yet not bother by even introducing time or service-based sharding/indexing and just bruteforce every existing file is just insane.

It is like someone, instead of considering reasonable options like, dunno:

  • naming log files by hash of the service name so it never needs to open more than 1/n files
  • naming log files by service name (or service group) so there is only 1 file needed to open, ever for recent stuff and more is needed only for history (you know, like logrotate + syslog combo ends up being)
  • using actual well tested database like SQLite and just create few bigger files with index/TOC

They decided one evening "you know, I always wanted to make a binary logging format", then got bored after few weeks and never touched it again.

@otisg
Copy link

otisg commented May 5, 2020

Ouch. I had not realized things were so slow. So I'm amazed why so many people at https://www.reddit.com/r/linuxadmin/comments/gdfi4t/how_do_you_look_at_journald_logs/ said they consume journal logs via journalctl. Are they all OK with slowness?!? Why don't more people get their logs out of journal, centralize them in an on-prem or SaaS service that is faster?

Anyhow, I see some systemd developers here. I wonder if they plan on integrating something like https://github.com/tantivy-search/tantivy ...

@PAStheLoD
Copy link

The first invocation is slow. It probably goes through the journal files and checks them. Then things are fast for a while.

journald is not great for log management, and it's simply not fit for any kind of centralized log management at scale. But it's very likely not a goal of the systemd project to handle even that too. The journal is a necessity, just as pid1, udev, and network setup (for remote filesystems) to manage a Linux system and its services reliably.

That said it's entirely likely that with a few quick and dirty optimizations this could be worked around. (Eg. if journal files are not in cache, don't wait for them when showing status; allow streaming the journal without looking up the last 10 lines, persisting some structures to speed up journal operations, enabling unverified journal reads by default, etc.)

@vcaputo
Copy link
Member

vcaputo commented Oct 27, 2020

@XANi Are you still suffering from these very slow uncached systemctl status collectd invocations?

If so, curious if you'd be willing to try out a build with #17459 applied and report back the results, thanks.

@XANi
Copy link
Author

XANi commented Oct 27, 2020

@vcaputo Correct me if I'm wrong but that patch only searches in current boot id, so if system is running for a long time it wouldn't change anything ?

I have encountered the problem on server machines in the first place (and on personal NAS) so in almost every case the current boot ID is only one in the logs. It certainly would help on desktop but that's not where I hit the problem in the first place (also AFAIK most desktop distros don't have /var/log/journal by default so journalctl doesn't log to HDD in the first place).

The other problem is that current implementation makes it really easy for one service to swamp the logs to the point you lose any other service's logs. It is especially apparent for services that don't emit much logs, and even tho there is zero actual logs for the service (as they get rotated out) it still takes about the same amount of time as for any other service.

For reference I have it happening on machine with just 1GB of journal and last few days of logs there.

@vcaputo
Copy link
Member

vcaputo commented Oct 27, 2020

@XANi Yes you're right, if all the journals are from the same boot the early exit never occurs.

FTR it already matches the current boot id, the patch just moves the order around. The change assumes there are probably multiple boots represented in the archived journals.

@rnhmjoj
Copy link
Contributor

rnhmjoj commented Jan 16, 2024

The stat problem is worked on: #30209.
I don't know if that's enough to solve the journal performance issues, though.

@XANi
Copy link
Author

XANi commented Jan 16, 2024

It should just write to SQLite files not to this godawful hacked together format.

Also ability to just query the log file via SQL would have been divine.

@vcaputo
Copy link
Member

vcaputo commented Jan 16, 2024

The stat problem is worked on: #30209. I don't know if that's enough to solve the journal performance issues, though.

That code didn't even exist when this issue was created/discussed. It was introduced in 2023 by 34af749

So at best #30209 undoes a later perf regression introduced by adding that stat storm in 2023, it's not causal for what's going on here.

@zdzichu
Copy link
Contributor

zdzichu commented Jan 16, 2024

For the perspective:

% time systemctl status --no-pager systemd-networkd                                                                                                   
 …                                                                                                                                                             
real    0m17,491s                                                                                                                                             
user    0m0,013s                                                                                                                                              
sys     0m1,171s 

Tha't on quite recent systemd-254.7-1.fc39.x86_64, on nvme-cached, btrfs raid-1 on 2 spinning Seagate IronWolf Pro.

@davispuh
Copy link

davispuh commented Jan 27, 2024

For the perspective:

% time systemctl status --no-pager systemd-networkd                                                                                                   
 …                                                                                                                                                             
real    0m17,491s                                                                                                                                             
user    0m0,013s                                                                                                                                              
sys     0m1,171s 

Tha't on quite recent systemd-254.7-1.fc39.x86_64, on nvme-cached, btrfs raid-1 on 2 spinning Seagate IronWolf Pro.

I wish it was only 20s, try to use journal when

# time systemctl status stuff
[...]
real    4m47,699s
user    0m0,421s
sys     2m24,206s

And this is after running it once before, first time it took over 8mins. And my whole /var is on NVME. Also my journal folder is only ~3.5 GiB so actually it's way slower than it was back then in 2016 when I originally commented so things have become only worse not better.

Looking at iotop I see journalctl reading 200 MiB/s on average so looks like it's reading over and over again same stuff what a joke.

Also journalctl -b --no-pager -u stuff took almost hour 😂

@vcaputo
Copy link
Member

vcaputo commented Jan 27, 2024

#30209 isn't merged yet, v254 contains 34af749 which introduced the fstats that hurt performance, so you're likely just experiencing the known regression at this point.

No significant changes have been made to the journal file format prioritizing improved reader/journalctl performance.

@XANi
Copy link
Author

XANi commented Jan 28, 2024

@vcaputo v254 is as slow as it was when I first reported that bug. Back then it took 12.5s to get the status out of 4GB of journals, now it takes 5s to do the same on 1GB (I have since then limited journal size to not make using it be so awful)

It's impossible for it to be fast if it doesn't keep track what was last log that given app used or had proper indexing (and at that point just use SQLite).

@vcaputo
Copy link
Member

vcaputo commented Feb 1, 2024

It's not like the journal lacks any acceleration structures, IIRC the main expected source of poor scaling is when there's a large number of journals participating in the query since that scales O(N) where N is the number of journals. Per-journal the entry arrays are binary-searched and grow exponentially, so for any given journal it should be a relatively quick sequence iterating the start to the last entry array, no? Perhaps I'm misremembering the details here.

I feel like there might be some value in doing some instrumenting and verifying the amount of journal data accessed is at least consistent with what's expected for the existing file format. It's possible to accidentally have extraneous work performed while still producing correct results with this type of thing.

@XANi
Copy link
Author

XANi commented Feb 2, 2024

The edge case is where journal doesn't have any logs for a given service.

Imagine server where some of the services are writing logs regularly, while other might not output anything for days or weeks (say nginx which only emits logs on start and end, and otherwise writes its own log files elsewhere).

Once you get to the point where the "silent" service gets rotated out of journald files and it takes long time to get even the last few lines it emitted.

I also have no idea why plain grepping the same journal files is faster...

@vcaputo
Copy link
Member

vcaputo commented Feb 2, 2024

The edge case is where journal doesn't have any logs for a given service.

Imagine server where some of the services are writing logs regularly, while other might not output anything for days or weeks (say nginx which only emits logs on start and end, and otherwise writes its own log files elsewhere).

Once you get to the point where the "silent" service gets rotated out of journald files and it takes long time to get even the last few lines it emitted.

Except that doesn't really make sense, based on my recollection of how this should work. If that's what's happening in that edge case, I think it's a bug in the implementation.

The match of _BOOT_ID is ANDed with all the other conditions. Your edge case would be where all the other conditions fail to advance the location, and your description makes it sound like everything matching the _BOOT_ID is then getting visited anyways.

My recollection is that this should just stop the matching from progressing further at all, short-circuiting without visiting all the entries for the _BOOT_ID.

It could just be a silly bug to go fix. Try running the systemctl status $silent.service with SYSTEMD_LOG_LEVEL=debug and look for the journal filter line, it should make clear what the match conditions are.

@XANi
Copy link
Author

XANi commented Feb 3, 2024

The match of _BOOT_ID is ANDed with all the other conditions. Your edge case would be where all the other conditions fail to advance the location, and your description makes it sound like everything matching the _BOOT_ID is then getting visited anyways.

It does look like it is. But even if it was not it would mean that just one spammy app would make any status/log request for any other app slow.

At the very least each service should store the pointer to last file it had its logs in so it could be short circuited instead of siting thru gigabytes of logs just to display its status. Or hell, slap a filename over

It could just be a silly bug to go fix. Try running the systemctl status $silent.service with SYSTEMD_LOG_LEVEL=debug and look for the journal filter line, it should make clear what the match conditions are.

As you wish (systemd from Debian 12, 242.6):

# echo 3 > /proc/sys/vm/drop_caches
# time systemctl -l status nginx 2>&1 |ts >/tmp/log #save it to log with timestamps
real	0m3.819s
user	0m0.045s
sys	0m0.184s

# cat /tmp/log  |grep File |wc -l # get the number of "File ... added." lines
81

# find /var/log/journal -type f |wc -l # get number of files in the journal
81

Lo and behold, every single file visited. Here is full log: log.txt

Journal filter:

Feb 03 01:02:17 Journal filter: (_BOOT_ID=a2e5ac42f2ea4e9f8f52080d4be63985 AND ((OBJECT_SYSTEMD_UNIT=nginx.service AND _UID=0) OR (UNIT=nginx.service AND _PID=1) OR (COREDUMP_UNIT=nginx.service AND _UID=0 AND MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1) OR _SYSTEMD_UNIT=nginx.service))

But let's test more. let's restart nginx so there are some fresh logs in journal:


# systemctl restart nginx
# echo 3 > /proc/sys/vm/drop_caches
# cat /tmp/log  |grep File  |wc -l
81

But hey, maybe that's not enough new lines and it have to get all the journal to return more than just "Starting nginx" ?

# for a in {1..2000} ; do systemctl restart nginx ; done # now my entire log is failed nginx restart requests because systemd didn't liked that many requests too fast, which is entirely understandable

# time systemctl -l status  nginx 2>&1 |ts >/tmp/log

real	0m4.006s
user	0m0.053s
sys	0m0.198s
# # it got slower, coz logs are now bigger

# cat /tmp/log  |grep File  |wc -l
81
# find /var/log/journal -type f |wc -l
81

Still scans every file even if it has recent logs available, despise command needing like
10 last log lines to display.

@vcaputo
Copy link
Member

vcaputo commented Feb 3, 2024

Visiting every file is expected, they all participate in the matching process, that's by design.

What's unexpected is visiting every entry in every journal. It shouldn't have to read in the entirety of every journal file to answer a systemctl status $foo.service style query AIUI.

I feel like we're talking past eachother and the issue's going in circles, but I don't have time right now to dig into the code and run this to ground. My efforts historically mostly focused on the journald/writer side of things, not the consumer/reader side.

@XANi
Copy link
Author

XANi commented Feb 3, 2024

What's unexpected is visiting every entry in every journal. It shouldn't have to read in the entirety of every journal file to answer a systemctl status $foo.service style query AIUI.

I've stopped all the services to not fuss with io readings of test now and:

  • it took 3 seconds (shorter than before because there is nothing going in the system)
  • it ran my hard drives up to 150MB/s entire time

Out of 1.3 GB of journal it read ~450MB of it. So yeah, while it isn't visiting every single byte of the file, it's still terrible

My efforts historically mostly focused on the journald/writer side of things, not the consumer/reader side.

Well, the way it is structured is pretty much unfixable to be efficient at finding log for a single service.

@vcaputo
Copy link
Member

vcaputo commented Feb 9, 2024

Well that's still a lot different from reading all of the entries, and indicates to me that no, we don't have a silly bug here to go fix.

An important implementation detail to keep in mind is the journal-file's use of mmap, relying heavily on the kernel's async read-ahead/around to achieve some semblance of decent performance.

The kernel doesn't have any knowledge of the journal-file layout/access patterns, and the layout isn't optimized to try pack as much query-relevant data contiguously for any given kernel read-ahead/around. So we pull in a lot of "chaff" in those async reads triggered by the "wheat" accesses.

If you tune the kernel to use smaller read-ahead/around sizes, you'll find the performance gets even worse, despite reading much less data from disk. This is because the userspace code accessing the data via mmap becomes blocked more often waiting on the granular faults, in a "death by a thousand papercuts" fashion, by becoming more synchronous with the kernel handling the faults.

I wonder if userfaultfd could be exploited to implement a layout-aware readahead that knows what query is being performed and where the data resides in the relevant journal files. Maybe if sd-journal could handle its own page faults, we could avoid reading so much "chaff", without having to significantly change either the on-disk layout or the mmap-reliant implementation. Some food for thought and experimentation perhaps...

@XANi
Copy link
Author

XANi commented Feb 9, 2024

Well that's still a lot different from reading all of the entries, and indicates to me that no, we don't have a silly bug here to go fix.

Yes, it is not "silly bug", it's just a fundamental fault with log format and way it is written and accessed. just having logfile-per-service would cut that by orders of magnitude. Hell, even hashing logfile names ("service x goes into hash(servicename).[0..3])") would cut it by orders of magnitude.

Check status of a single service is the most often use of for checking the logs and somehow happens to also be worst case scenario for this dreary design.

@AbsurdlySuspicious
Copy link

AbsurdlySuspicious commented Feb 10, 2024

As another point of comparison: I need to keep a lot of logs (tens of GB), and I regularly need to extract logs within a 24-48 hour span from about 2-4 weeks ago (from single service). I had to set up syslog-ng, the functionality of which I didn't really need, but I had to use it because extracting logs from journald in the scenario described above can take almost half an hour. This setup created a lot of journal files, but they were on a fast nvme ssd.

So in syslog-ng I set up mysql backend and text log per service with logrotate as backup. I haven't even had to use mysql log yet, because searching multiple rotated files of a single service log (to find the start date of the range I need to extract) with ripgrep takes several seconds (!) instead of tens of minutes. Given that logrotate rotates logs every week, and I keep them for at least a year, this is also a lot of files that kernel has no knowledge of to create efficient caches, let alone that ripgrep doesn't know that file lines are sorted by date, but it still performs a giant lot better.

@XANi
Copy link
Author

XANi commented Feb 14, 2024

I did some playing around and results has been quite interesting:

The data journalctl returns:

# journalctl |dd of=/dev/zero bs=1M
0+1594 records in
0+1594 records out
6531134 bytes (6.5 MB, 6.2 MiB) copied, 0.615293 s, 10.6 MB/s
# journalctl |wc -l 
44464

6.5MB worth of text, 44k lines of text. It takes (freshly cleaned out before test, but same ratios as when it got to 1GB):

du -x /var/log/journal/ -h
57M	/var/log/journal/44cf6f547971fc33309d1e9e000002e7

now I assume binary log entries have a bit more metadata but I've seen anywhere betweeh ~850 to >1k per log entry when it got to 1GB allowed for /var/log/journal, which seems excessive.

So yeah, no wonder it is slow if each entry takes so much more than actual info in it. That on top of no way to lookup known

for comparision, just dumping it to SQLite database:

 sqlite /tmp/log.sqlite 'select count(*) from log'
51834
[14:55:57]cthulhu:/tmp☠ ls -s /tmp/log.sqlite -h
11M /tmp/log.sqlite

(there is more log entries because it is sucked off rsyslog that does some remote logging for me)
adding index bumps it to around 15MB. I assume adding boot_id would bump it a bit but not to level of journalctl.

baszoetekouw added a commit to SURFscz/SRAM-deploy that referenced this issue Feb 27, 2024
@davispuh
Copy link

davispuh commented Mar 7, 2024

More log fun, whenever I try to query service status I get

# journalctl -u service --since -10m
Too many journal files (limit is at 7168) in scope, ignoring file '/var/log/journal/873a5f55f2aa4b33b2568baca40e6a91/system@13e94865c50347e3a62de9c85946c18e-0000000000991ecf-0006131335742f28.journal'.

-- No entries --

It took forever to even tell me "No entries" and I know it's lying, there for sure are some entries...

@Weltolk
Copy link

Weltolk commented Aug 9, 2024

same issue

@dtardon
Copy link
Collaborator

dtardon commented Aug 12, 2024

It took forever to even tell me "No entries" and I know it's lying, there for sure are some entries...

It also told you why it couldn't produce meaningful output, but you chose to ignore that part:

Too many journal files (limit is at 7168) in scope, ignoring file '/var/log/journal/873a5f55f2aa4b33b2568baca40e6a91/system@13e94865c50347e3a62de9c85946c18e-0000000000991ecf-0006131335742f28.journal'.

If you have a custom journald config. that produces that many files, you should rethink it. Running journalctl --vacuum-time=X/--vacuum-size=Y/--vacuum-files=Z may help for a time.

@XANi
Copy link
Author

XANi commented Aug 12, 2024

@dtardon why it can't just shard by app name ? Current approach is excessively wasteful in every aspect. It's equivalent of selecting entire database any time you need a record by one of mainly used columns...

@dtardon
Copy link
Collaborator

dtardon commented Aug 13, 2024

why it can't just shard by app name ?

I don't understand what you mean by this...

Current approach is excessively wasteful in every aspect. It's equivalent of selecting entire database any time you need a record by one of mainly used columns...

To get the last few records, it's not necessary to read all journal files in full. But all journal files have to be opened to figure out where these last few records are. Which can be a problem if there are too many of them there, like in your case. (In fact, you have even more files than are allowed by the sd-journal library, so some--well, one--of them are not opened, which necessarily leads to gaps in the log.)

@XANi
Copy link
Author

XANi commented Aug 13, 2024

why it can't just shard by app name ?

I don't understand what you mean by this...

write to journal file named hash(service_name)[0,2]_system@bootid instead of system@bootid

So if hashed service name hashes to da39a3ee5e6b4b0d3255bfef95601890afd80709 the file name would be da_system@bootid and the tools would need to only open 1/256 of total data if it was looking for a particular service.

That's just ugly workaround but really anything is better than current garbage implementation.

Hell, just save a name last file service wrote log into together with rest of the service info.

To get the last few records, it's not necessary to read all journal files in full. But all journal files have to be opened to figure out where these last few records are.

In case of last log entry being few days ago (or rotated out) that still ends up being majority of logs being read. For my 1GB of logs asking for app with rotated out logs does around 440MB of reads on simple systemctl status <service>

Which can be a problem if there are too many of them there, like in your case.

Wrong person, I'm not the guy hitting 7k limit. I have only 1GB of journal and it still takes multiple seconds on my NAS, because it is a NAS so all free ram gets used to caching other stuff, so the moment I need to use any of the systemd commands touching logs there is none of older logs in disk cache and it takes forever

@dtardon
Copy link
Collaborator

dtardon commented Aug 14, 2024

write to journal file named hash(service_name)[0,2]_system@bootid instead of system@bootid

So if hashed service name hashes to da39a3ee5e6b4b0d3255bfef95601890afd80709 the file name would be da_system@bootid and the tools would need to only open 1/256 of total data if it was looking for a particular service.

It would also produce 256 times more files, increasing disk usage (active journal files have a fixed minimal size--4 MB IIRC) and making hitting the file limit more probable. That's not mentioning that journald would have to keep 256 opened file descriptors instead of 1 (FDs are a limited resource).

That's just ugly workaround but really anything is better than current garbage implementation.

There are more use cases for journal than showing a few lines of log in systemctl status, you know. You propose to (possibly) improve perf. of systemctl status at the cost of making everything else worse.

@Self-Perfection
Copy link

To get the last few records, it's not necessary to read all journal files in full. But all journal files have to be opened to figure out where these last few records are.

All files has to be opened in absolutely worst case. In the most common scenario it should be enough to check most recent journal file and find that it has enough entries. And the most recent file can be found just by ordering them by modification time, it needs very small amount of IO.

Reading from all journal files for each status report is extremely inefficient.

@dtardon
Copy link
Collaborator

dtardon commented Aug 15, 2024

To get the last few records, it's not necessary to read all journal files in full. But all journal files have to be opened to figure out where these last few records are.

All files has to be opened in absolutely worst case. In the most common scenario it should be enough to check most recent journal file and find that it has enough entries. And the most recent file can be found just by ordering them by modification time, it needs very small amount of IO.

touch $(ls -tr /var/log/journal/$(</etc/machine-id)/system*.journal | head -n 1)

Suddenly the oldest archived journal file is the most recent... And that's just an artificial counter-example. There are completely legitimate time shifts (e.g., DST changes) that may cause the active journal file not to be the most recent one. The sd-journal library still tries to preserve correct ordering of logs in such case, but it couldn't do that relying just on mtimes.

@Self-Perfection
Copy link

Well there are should be ways to optimize this. At least journalctl can safely assume that current active journal file is newer then all archived files. Thus if active journal file contains enough records to populate status output, then archived files should not be accessed.

There are completely legitimate time shifts (e.g., DST changes) that may cause the active journal file not to be the most recent one.
I think any time shifts affect credibility of both filesystem file timestamps and timestamps stored within files in roughly the same way.
If file timestamp seem unreliable, then let's store last record timestamp in filenames of archived files.

@chutzimir
Copy link

When reading logs I have the expectation to see them in the order they are produced. If there are time jumps during that time, I am OK with it.

I imagine old-school admins who are used to tailing /var/log/messages would expect the same thing.

There are completely legitimate time shifts (e.g., DST changes) that may cause the active journal file not to be the most recent one.

There is also a reasonable expectation that the system time is monotonic. I cannot think of a "legitimate" time shift? DST is certainly not it, because the timestamps (mtimes) are recorded as Unix times which is not affected by DST.

E.g., here is an example of DST moving back while Unix time is moving forward

$ for t in 2024-11-03T{04..07}:{00,30}Z; do t=$(date +%s -d $t); echo "Unix time: $t, $(TZ=UTC date -d @$t), $(TZ=America/New_York date -d @$t)"; done
Unix time: 1730606400, Sun  3 Nov 04:00:00 UTC 2024, Sun  3 Nov 00:00:00 EDT 2024
Unix time: 1730608200, Sun  3 Nov 04:30:00 UTC 2024, Sun  3 Nov 00:30:00 EDT 2024
Unix time: 1730610000, Sun  3 Nov 05:00:00 UTC 2024, Sun  3 Nov 01:00:00 EDT 2024
Unix time: 1730611800, Sun  3 Nov 05:30:00 UTC 2024, Sun  3 Nov 01:30:00 EDT 2024
Unix time: 1730613600, Sun  3 Nov 06:00:00 UTC 2024, Sun  3 Nov 01:00:00 EST 2024
Unix time: 1730615400, Sun  3 Nov 06:30:00 UTC 2024, Sun  3 Nov 01:30:00 EST 2024
Unix time: 1730617200, Sun  3 Nov 07:00:00 UTC 2024, Sun  3 Nov 02:00:00 EST 2024
Unix time: 1730619000, Sun  3 Nov 07:30:00 UTC 2024, Sun  3 Nov 02:30:00 EST 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests