New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug intermittent backup/recovery CI errors #3107
Comments
An offhanded idea regarding Regardless what the actual reason is for I assume kernel log messages on the console Therefore
By the way:
because in grep a word consists of |
kernel log messages on the console are useful for diagnostics and thus wanted when circumstances are not normal anymore. Here is the excerpt from
When debugging #2777 I tested what was needed to make the XFS error appear on the console, and took notes. It was necessary to set console the log level to at least 5. For this reason, I would like to raise the log level to at least 5 even if not in the unattended mode. |
I mean, raise from the current setting, the kernel default may be even higher. |
So dmesg log level 4 "err - error conditions" is insufficient |
Yes, it is clearly an error because the kernel could not do what was requested. Unfortunately, not much we can do about it. |
I like 'dmesg -n 5' very much. I think (according to what I get on my test system) |
I am thinking about to call |
verbose mode is set in ReaR, not on boot, though. I believe that the only appropriate option that you can set on the kernel command line before booting is |
Ah, you wrote "in sbin/rear". Then it should be removed from the startup script to avoid setting the console level at multiple places. |
I like 'dmesg -n 5' in [skel/default]/etc/scripts/boot With 'dmesg -n 5' in [skel/default]/etc/scripts/boot In contrast without any 'dmesg -n ...' in /etc/scripts/boot I guess that this was the reason behind for |
Regarding 'dmesg -n [4-7]' in sbin/rear: I like that too. I tested various manual 'dmesg -n [4-7]' before calling "rear recover" I.e. ReaR recovery system startup was with 'dmesg -n 5' and I think it is helpful for debugging problems |
For the fun of it: I tried with 'dmesg -n 8' before "rear recover" In particular many kernel messages during disk layout recreation. So 'dmesg -n 8' is over the top for "rear recover". I wonder what "debug - debug-level messages" in 'dmesg -h' means. |
@pcahyna
It seems about 2GB of memory are available. What does this ISO contain? A usual ReaR recovery system ISO is about 200MB and with
I get one with about 70MB on my test VM. Does your ISO perhaps also contain the backup? |
That's exactly it. There is no other place than RAM to store the backup so that it survives layout recreation. |
In #3108, the test errored again, the VM just froze. I reran it and there is actually lots of useful output on console:
|
@jsmeix ok, I agree that it makes sense to set console log level both in system setup and |
I will implement setting the console log level |
@pcahyna For comparison:
so ISO plus backup below 1GB
|
In [skel/default]/etc/scripts/boot set dmesg -n 5 to limit console logging for 'dmesg' messages to level 5 so that kernel error and warning messages appear (intermixed with ReaR messages) on the console so that the user can notice when things go wrong in kernel area which helps to understand problems, see the related issue #3107
Last time the test succeeded seems to be here https://github.com/rear/rear/runs/19517989303 . I suspect it may be some kernel change, the log at that time shows 4.18.0-526.el8.x86_64. |
@jsmeix good question! Failed run has this:
https://artifacts.dev.testing-farm.io/f6bd055b-71dd-4cb7-9550-89007881c9fa/work-backup-and-restoretizyhdqu/tests/plans/backup-and-restore/execute/data/guest/default-0/make-backup-and-restore-iso-1/output.txt - hidden in a very deep directory unfortunately. Successful run had this:
https://artifacts.dev.testing-farm.io/df70fe97-ba51-466e-8472-ba72bead8bd9/ |
@pcahyna
e.g. here to show all files and directories For example on my above rather minimal SLES15-SP5 test VM:
|
Regarding "suspect it may be some kernel change": In this case I think "the usual suspects" To avoid a needless big ReaR recovery system initrd
is basically mandatory. |
@jsmeix indeed that was it, thank you for the tips:
The size of the firmware increased a bit, but that should not be the reason. It seems that the test started to fail and used to pass before because before there was no linux-firmware package installed on the test machines and now there is. And the
I will add |
@pcahyna So I think that additionally you need to Alternative it should work to not install (or uninstall) |
|
@jsmeix thanks a lot for the helpful suggestions, the test now passes as you can see in your latest PR (I retriggered it manually). |
See the failed
testing-farm:centos-stream-8-x86_64
test in #3104 : https://github.com/rear/rear/pull/3104/checks?check_run_id=19567464768As console log in the Testing Farm infrastructure is now enabled, we can look at the console output: https://artifacts.dev.testing-farm.io/a24b1484-7fc1-4e2a-8077-e72f9eea6cca/work-backup-and-restorefmaugnmx/console-68975dbe-0cbe-49ae-aa05-47d8fec8b66f.log . It show this error:
It does not tell us what happened, but it at least shows where. I suspect that the fact that we don't see what happened is due to the kernel not printing its log messages, which is due to
rear/usr/share/rear/skel/default/etc/scripts/boot
Line 4 in b1e9f5e
that was added 92d3a15 with the commit message
Added dmesg -n1 to rescue system
without explaining why.Having kernel log messages on the console would be helpful for debugging filesystem-related errors like #2777 (see #3058 for explanation : when mount fails, the mount commands display only the unhelpful message
wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error
, and the real cause of the error likeXFS (...): logbuf size must be greater than or equal to log stripe size
is printed by the kernel to the message log).Anyway, since the error happens when loading the ISO into a ramdisk (the whole ISO gets loaded into a ramdisk in this test procedure), I suspect that we are out of memory at this point. The failed test shows
and a successful test ( CentOS-Stream-9 ) shows
so indeed, successful test seems to have run on a machine with more memory than the failed test.
Originally posted by @pcahyna in #3104 (comment)
The text was updated successfully, but these errors were encountered: