journalctl slowness #246

mbaldessari · 2014-03-31T20:18:22Z

We currently collect the following in the systemd plugin:
self.add_cmd_output("journalctl --verify")
self.add_cmd_output("journalctl --all --this-boot --no-pager")
self.add_cmd_output("journalctl --all --this-boot --no-pager -o verbose")

At least on my F20 system which has a 4.1G /var/log/journal dir this takes:
5mins for journalctl --verify
~2mins for journalctl --all --this-boot --no-pager > /tmp/foo
~2mins for journalctl --all --this-boot --no-pager -o verbose > /tmp/foo

Without the systemd plugin it takes 1min 40seconds for the full sosreport to run.
Two thoughts on this:

Can we maybe remove at least one of the two --all calls and leave only the --verbose one?
Maybe we should make --verify default to off and to have it, we must enable explicitly?

If this has been discussed before and consensus was already reached, feel free to close this one out ;)

adam-stokes · 2014-04-01T00:35:52Z

@bmr-cymru, what do you think? sounds ok to me but I'm not fully up to speed on systemd yet to make any suggestions.

bmr-cymru · 2014-04-01T09:20:10Z

I think there's a few things we need to understand here:

Is a 4.1G journal 'normal'?
Are these run times expected even if the journal is that size?

Even so we may need to take steps to detect this and do something. I'm very surprised though that --this-boot is so slow even with a large journal. Surely it doesn't need to read $everything to show the latest records?

This doesn't feel like the Brave New World we were sold on with this newfangled logamathing. :-)

bmr-cymru · 2014-04-01T09:54:38Z

'journalctl is slow' seems to be a rather common meme...

mbaldessari · 2014-04-01T09:54:45Z

So I did a quick show of hands at work regarding the sizes of /var/log/journal. Here's the result:
1.4G
1.3G
1.2G
984M
146M
809M
820M
45M
54M
193M
67M

So I guess I definitely fall outside of the average. It might be due to the fact that I never reinstall and always just yum update. So I've constantly upgraded this box since about 2years and 3months. I'll try and profile these commands and raise BZs against that to get the systemd's folks opinion on this

bmr-cymru · 2014-04-01T10:25:00Z

There are a few threads on systemd-devel discussing the problem but so far not even a comment from a maintainer:

http://lists.freedesktop.org/archives/systemd-devel/2013-September/013376.html
http://lists.freedesktop.org/archives/systemd-devel/2013-October/013498.html

bmr-cymru · 2014-06-07T11:22:30Z

I added some simple command profiling in commit fd68a0c:

commit fd68a0ca2e65e22bcfcfe886676693e7a339e1e1
Author: Bryn M. Reeves <bmr@redhat.com>
Date:   Fri Jun 6 19:39:44 2014 +0100

    Record duration of calls to external programs at debug level

    The old profiling report was removed in favour of using external
    profilers like cProfile. This is useful for overall profiling and
    identifying code hotspots but is less helpful for drilling down
    to the collection of individual items.

    Since running programs is the dominant cost on typical hosts log
    the time taken to run each call and tag it with the binary name.

    This allows looking at the per-call and per-command cost across a
    run of sosreport.

    Signed-off-by: Bryn M. Reeves <bmr@redhat.com>

This allows you to easily see the most expensive commands sos is running, by command name or by plugin:

$ /tmp/soscmds.py 
Plugins:
[plugin:pcp] 0.200337s
[plugin:usb] 0.393647s
[plugin:sar] 0.456284s
[plugin:lvm2] 0.487352s
[plugin:kernel] 0.528694s
[plugin:systemtap] 0.535331s
[plugin:smartcard] 0.536921s
[plugin:general] 0.580074s
[plugin:firewalld] 0.692489s
[plugin:networking] 0.727666s
[plugin:samba] 0.737793s
[plugin:process] 0.926314s
[plugin:block] 1.009134s
[plugin:boot] 1.269422s
[plugin:tuned] 1.845853s
[plugin:yum] 3.362885s
[plugin:grub2] 3.593456s
[plugin:selinux] 5.527698s
[plugin:systemd] 34.034044s
[plugin:rpm] 50.551121s
Commands:
'pkcs11_inspect' 0.306482s
'nmcli' 0.354518s
'wbinfo' 0.362981s
'testparm' 0.374812s
'lsusb' 0.393647s
'sh' 0.488693s
'stap' 0.527244s
'parted' 0.570049s
'ls' 0.579363s
'firewall-cmd' 0.692489s
'ausearch' 0.737173s
'lsof' 0.796892s
'lsinitrd' 1.262425s
'tuned-adm' 1.845853s
'semodule' 1.916661s
'semanage' 2.580411s
'yum' 3.362885s
'grub2-mkconfig' 3.574695s
'journalctl' 33.615310s
'rpm' 50.350750s

See Measuring command run times for the scripts to process the raw logs.

Obviously RPM is the bigger problem for most systems right now; there are some changes I've got planned to improve this somewhat.

But even on my 8 week old F20 install I'm seeing quite long run times in journalctl (~35s typically). This is with around 600M in /var/log/journal:

# du -ch /var/log/journal/a38633d714b7431d82e8c98a62f81846/
617M    /var/log/journal/a38633d714b7431d82e8c98a62f81846/
617M    total

portante · 2014-06-09T15:20:59Z

On my FC20 system, the "journalctl --verify" took well over a minute to run.

Is that necessary to run given we just want to collect necessary data for post-processing? I have been disabling this plug-in for our uses today, just too heavy for a summary report.

bmr-cymru · 2014-06-09T15:43:24Z

It's one of the pieces the systemd folks asked us to include:

https://bugzilla.redhat.com/show_bug.cgi?id=879619#c0

I've been keeping an eye on its cost ever since with a view to making this optional.

This and the additional package verification now done in the rpm plugin account for about 90% of the runtime of a default invocation on my test host.

One option I'm considering is to introduce a global '--verify' command line option. This can then be tested by individual plugins to determine whether or not to run costly verification actions.

portante · 2014-06-09T15:51:34Z

I had almost 4 GB of journalctl files on my system, so a verify operation took about 10 minutes. I changed my settings so that I only keep 2 weeks of data, and a maximum of 200 MB.

bmr-cymru · 2014-06-09T16:38:25Z

That's similar to the results Michele was seeing with a 4.1GB journal. In some ways we have a bit of a get-out here in that only Fedora is currently enabling the on-disk journal (so RHEL7 won't see this problem).

With the numbers I'm hearing though I'm more inclined to work towards having a --verify switch (default off) for 3.2.

bmr-cymru · 2014-07-08T13:39:57Z

Would be great to get some feedback on how this behaves now with and without the new --verify option.

bmr-cymru · 2014-07-16T08:28:30Z

Ping, anyone had a chance to test this? I'll try to update the f21 / rawhide builds this week for anyone testing on Fedora or RHEL7.

mbaldessari · 2014-07-23T08:20:34Z

Hi Bryn,

apologies for the late reply (moving countries is one big PITA). I like the --verify approach. Those that need that level of assurance can specify it and the default case is much more reasonable. Here are my results:
[root@marquez sosreport]# echo 3 > /proc/sys/vm/drop_caches && du -hcs /var/log/journal/
4.1G /var/log/journal/
4.1G total
[root@marquez sosreport]# time ./sosreport --batch &> /tmp/sos1.log
real 1m58.353s
user 0m40.123s
sys 0m17.392s
[root@marquez sosreport]# echo 3 > /proc/sys/vm/drop_caches && du -hcs /var/log/journal/
4.1G /var/log/journal/
4.1G total
[root@marquez sosreport]# time ./sosreport --verify --batch &> /tmp/sos1.log
real 7m34.632s
user 5m13.151s
sys 1m1.205s

If you agree, I'd say we can close this one?

cheers,
Michele

adam-stokes · 2014-08-07T12:09:38Z

2 minutes is still a long time to wait for sosreport to finish

bmr-cymru · 2014-08-07T13:27:08Z

We're improving; journalctl is no longer the worst offender in my tests but I'd like to see some more general improvement before we get 3.2 out.

bmr-cymru · 2014-09-13T11:19:27Z

I think the current state's reasonably acceptable here; the longer runtimes are only seen on systems with huge journals or when running with --verify. I'll close this out for now but please re-open if anyone's seing long journalctl delays elsewhere.

adam-stokes added the medium label Apr 1, 2014

bmr-cymru added this to the 3.2 milestone Apr 14, 2014

bmr-cymru added the bugfix label Jun 8, 2014

bmr-cymru self-assigned this Jun 16, 2014

bmr-cymru closed this as completed Jul 8, 2014

bmr-cymru reopened this Jul 8, 2014

adam-stokes added high and removed medium labels Aug 19, 2014

bmr-cymru closed this as completed Sep 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journalctl slowness #246

journalctl slowness #246

mbaldessari commented Mar 31, 2014

adam-stokes commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

mbaldessari commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

bmr-cymru commented Jun 7, 2014

portante commented Jun 9, 2014

bmr-cymru commented Jun 9, 2014

portante commented Jun 9, 2014

bmr-cymru commented Jun 9, 2014

bmr-cymru commented Jul 8, 2014

bmr-cymru commented Jul 16, 2014

mbaldessari commented Jul 23, 2014

adam-stokes commented Aug 7, 2014

bmr-cymru commented Aug 7, 2014

bmr-cymru commented Sep 13, 2014

journalctl slowness #246

journalctl slowness #246

Comments

mbaldessari commented Mar 31, 2014

adam-stokes commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

mbaldessari commented Apr 1, 2014

bmr-cymru commented Apr 1, 2014

bmr-cymru commented Jun 7, 2014

portante commented Jun 9, 2014

bmr-cymru commented Jun 9, 2014

portante commented Jun 9, 2014

bmr-cymru commented Jun 9, 2014

bmr-cymru commented Jul 8, 2014

bmr-cymru commented Jul 16, 2014

mbaldessari commented Jul 23, 2014

adam-stokes commented Aug 7, 2014

bmr-cymru commented Aug 7, 2014

bmr-cymru commented Sep 13, 2014