Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journalctl slowness #246

Closed
mbaldessari opened this issue Mar 31, 2014 · 16 comments
Closed

journalctl slowness #246

mbaldessari opened this issue Mar 31, 2014 · 16 comments
Assignees
Milestone

Comments

@mbaldessari
Copy link
Contributor

We currently collect the following in the systemd plugin:
self.add_cmd_output("journalctl --verify")
self.add_cmd_output("journalctl --all --this-boot --no-pager")
self.add_cmd_output("journalctl --all --this-boot --no-pager -o verbose")

At least on my F20 system which has a 4.1G /var/log/journal dir this takes:
5mins for journalctl --verify
~2mins for journalctl --all --this-boot --no-pager > /tmp/foo
~2mins for journalctl --all --this-boot --no-pager -o verbose > /tmp/foo

Without the systemd plugin it takes 1min 40seconds for the full sosreport to run.
Two thoughts on this:

  1. Can we maybe remove at least one of the two --all calls and leave only the --verbose one?
  2. Maybe we should make --verify default to off and to have it, we must enable explicitly?

If this has been discussed before and consensus was already reached, feel free to close this one out ;)

@adam-stokes
Copy link

@bmr-cymru, what do you think? sounds ok to me but I'm not fully up to speed on systemd yet to make any suggestions.

@bmr-cymru
Copy link
Member

I think there's a few things we need to understand here:

  • Is a 4.1G journal 'normal'?
  • Are these run times expected even if the journal is that size?

Even so we may need to take steps to detect this and do something. I'm very surprised though that --this-boot is so slow even with a large journal. Surely it doesn't need to read $everything to show the latest records?

This doesn't feel like the Brave New World we were sold on with this newfangled logamathing. :-)

@bmr-cymru
Copy link
Member

'journalctl is slow' seems to be a rather common meme...

@mbaldessari
Copy link
Contributor Author

So I did a quick show of hands at work regarding the sizes of /var/log/journal. Here's the result:
1.4G
1.3G
1.2G
984M
146M
809M
820M
45M
54M
193M
67M

So I guess I definitely fall outside of the average. It might be due to the fact that I never reinstall and always just yum update. So I've constantly upgraded this box since about 2years and 3months. I'll try and profile these commands and raise BZs against that to get the systemd's folks opinion on this

@bmr-cymru
Copy link
Member

There are a few threads on systemd-devel discussing the problem but so far not even a comment from a maintainer:

http://lists.freedesktop.org/archives/systemd-devel/2013-September/013376.html
http://lists.freedesktop.org/archives/systemd-devel/2013-October/013498.html

@bmr-cymru bmr-cymru added this to the 3.2 milestone Apr 14, 2014
@bmr-cymru
Copy link
Member

I added some simple command profiling in commit fd68a0c:

commit fd68a0ca2e65e22bcfcfe886676693e7a339e1e1
Author: Bryn M. Reeves <bmr@redhat.com>
Date:   Fri Jun 6 19:39:44 2014 +0100

    Record duration of calls to external programs at debug level

    The old profiling report was removed in favour of using external
    profilers like cProfile. This is useful for overall profiling and
    identifying code hotspots but is less helpful for drilling down
    to the collection of individual items.

    Since running programs is the dominant cost on typical hosts log
    the time taken to run each call and tag it with the binary name.

    This allows looking at the per-call and per-command cost across a
    run of sosreport.

    Signed-off-by: Bryn M. Reeves <bmr@redhat.com>

This allows you to easily see the most expensive commands sos is running, by command name or by plugin:

$ /tmp/soscmds.py 
Plugins:
[plugin:pcp] 0.200337s
[plugin:usb] 0.393647s
[plugin:sar] 0.456284s
[plugin:lvm2] 0.487352s
[plugin:kernel] 0.528694s
[plugin:systemtap] 0.535331s
[plugin:smartcard] 0.536921s
[plugin:general] 0.580074s
[plugin:firewalld] 0.692489s
[plugin:networking] 0.727666s
[plugin:samba] 0.737793s
[plugin:process] 0.926314s
[plugin:block] 1.009134s
[plugin:boot] 1.269422s
[plugin:tuned] 1.845853s
[plugin:yum] 3.362885s
[plugin:grub2] 3.593456s
[plugin:selinux] 5.527698s
[plugin:systemd] 34.034044s
[plugin:rpm] 50.551121s
Commands:
'pkcs11_inspect' 0.306482s
'nmcli' 0.354518s
'wbinfo' 0.362981s
'testparm' 0.374812s
'lsusb' 0.393647s
'sh' 0.488693s
'stap' 0.527244s
'parted' 0.570049s
'ls' 0.579363s
'firewall-cmd' 0.692489s
'ausearch' 0.737173s
'lsof' 0.796892s
'lsinitrd' 1.262425s
'tuned-adm' 1.845853s
'semodule' 1.916661s
'semanage' 2.580411s
'yum' 3.362885s
'grub2-mkconfig' 3.574695s
'journalctl' 33.615310s
'rpm' 50.350750s

See Measuring command run times for the scripts to process the raw logs.

Obviously RPM is the bigger problem for most systems right now; there are some changes I've got planned to improve this somewhat.

But even on my 8 week old F20 install I'm seeing quite long run times in journalctl (~35s typically). This is with around 600M in /var/log/journal:

# du -ch /var/log/journal/a38633d714b7431d82e8c98a62f81846/
617M    /var/log/journal/a38633d714b7431d82e8c98a62f81846/
617M    total

@portante
Copy link
Contributor

portante commented Jun 9, 2014

On my FC20 system, the "journalctl --verify" took well over a minute to run.

Is that necessary to run given we just want to collect necessary data for post-processing? I have been disabling this plug-in for our uses today, just too heavy for a summary report.

@bmr-cymru
Copy link
Member

It's one of the pieces the systemd folks asked us to include:

https://bugzilla.redhat.com/show_bug.cgi?id=879619#c0

I've been keeping an eye on its cost ever since with a view to making this optional.

This and the additional package verification now done in the rpm plugin account for about 90% of the runtime of a default invocation on my test host.

One option I'm considering is to introduce a global '--verify' command line option. This can then be tested by individual plugins to determine whether or not to run costly verification actions.

@portante
Copy link
Contributor

portante commented Jun 9, 2014

I had almost 4 GB of journalctl files on my system, so a verify operation took about 10 minutes. I changed my settings so that I only keep 2 weeks of data, and a maximum of 200 MB.

@bmr-cymru
Copy link
Member

That's similar to the results Michele was seeing with a 4.1GB journal. In some ways we have a bit of a get-out here in that only Fedora is currently enabling the on-disk journal (so RHEL7 won't see this problem).

With the numbers I'm hearing though I'm more inclined to work towards having a --verify switch (default off) for 3.2.

@bmr-cymru bmr-cymru self-assigned this Jun 16, 2014
@bmr-cymru
Copy link
Member

Would be great to get some feedback on how this behaves now with and without the new --verify option.

@bmr-cymru bmr-cymru reopened this Jul 8, 2014
@bmr-cymru
Copy link
Member

Ping, anyone had a chance to test this? I'll try to update the f21 / rawhide builds this week for anyone testing on Fedora or RHEL7.

@mbaldessari
Copy link
Contributor Author

Hi Bryn,

apologies for the late reply (moving countries is one big PITA). I like the --verify approach. Those that need that level of assurance can specify it and the default case is much more reasonable. Here are my results:
[root@marquez sosreport]# echo 3 > /proc/sys/vm/drop_caches && du -hcs /var/log/journal/
4.1G /var/log/journal/
4.1G total
[root@marquez sosreport]# time ./sosreport --batch &> /tmp/sos1.log
real 1m58.353s
user 0m40.123s
sys 0m17.392s
[root@marquez sosreport]# echo 3 > /proc/sys/vm/drop_caches && du -hcs /var/log/journal/
4.1G /var/log/journal/
4.1G total
[root@marquez sosreport]# time ./sosreport --verify --batch &> /tmp/sos1.log
real 7m34.632s
user 5m13.151s
sys 1m1.205s

If you agree, I'd say we can close this one?

cheers,
Michele

@adam-stokes
Copy link

2 minutes is still a long time to wait for sosreport to finish

@bmr-cymru
Copy link
Member

We're improving; journalctl is no longer the worst offender in my tests but I'd like to see some more general improvement before we get 3.2 out.

@adam-stokes adam-stokes added high and removed medium labels Aug 19, 2014
@bmr-cymru
Copy link
Member

I think the current state's reasonably acceptable here; the longer runtimes are only seen on systems with huge journals or when running with --verify. I'll close this out for now but please re-open if anyone's seing long journalctl delays elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants