Debug intermittent backup/recovery CI errors #3107

pcahyna · 2023-12-13T15:57:34Z

See the failed testing-farm:centos-stream-8-x86_64 test in #3104 : https://github.com/rear/rear/pull/3104/checks?check_run_id=19567464768
As console log in the Testing Farm infrastructure is now enabled, we can look at the console output: https://artifacts.dev.testing-farm.io/a24b1484-7fc1-4e2a-8077-e72f9eea6cca/work-backup-and-restorefmaugnmx/console-68975dbe-0cbe-49ae-aa05-47d8fec8b66f.log . It show this error:

Running workflow recover within the ReaR rescue/recovery system
Running PRE_RECOVERY_SCRIPT 'mkdir /tmp/mnt; mount /dev/nvme0n1p1 /tmp/mnt/; modprobe brd rd_nr=1 rd_size=2097152; dd if=/tmp/mnt//var/lib/rear/output/rear-ip-172-31-16-70.iso of=/dev/ram0; umount /tmp/mnt/;'
/etc/scripts/system-setup: line 255:   545 Killed                  rear $rear_options recover
'eval rear $rear_options recover' results exit code 137
'rear recover' failed with exit code 137
1) View Relax-and-Recover log file(s)
2) Login at the rescue system

It does not tell us what happened, but it at least shows where. I suspect that the fact that we don't see what happened is due to the kernel not printing its log messages, which is due to

rear/usr/share/rear/skel/default/etc/scripts/boot

Line 4 in b1e9f5e

dmesg -n1

that was added 92d3a15 with the commit message Added dmesg -n1 to rescue system without explaining why.
Having kernel log messages on the console would be helpful for debugging filesystem-related errors like #2777 (see #3058 for explanation : when mount fails, the mount commands display only the unhelpful message wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error, and the real cause of the error like XFS (...): logbuf size must be greater than or equal to log stripe size is printed by the kernel to the message log).

Anyway, since the error happens when loading the ISO into a ramdisk (the whole ISO gets loaded into a ramdisk in this test procedure), I suspect that we are out of memory at this point. The failed test shows

Memory: 2052384K/3503836K available

and a successful test ( CentOS-Stream-9 ) shows

Memory: 2931284K/3942236K available

so indeed, successful test seems to have run on a machine with more memory than the failed test.

Originally posted by @pcahyna in #3104 (comment)

The text was updated successfully, but these errors were encountered:

jsmeix · 2023-12-14T09:15:21Z

An offhanded idea regarding dmesg -n1:

Regardless what the actual reason is for dmesg -n1
it seems that at least "under normal circumstances"
kernel log messages on the console are not wanted.

I assume kernel log messages on the console
do not matter when "rear recover" is run
in unattended mode because in unattended mode
there is noone who actively works on the console
so it does not matter which and how many messages
appear on the console.

Therefore dmesg -n1 may only be set when
"rear recover" is not run in unattended mode
similar as in [skel/default]/etc/scripts/system-setup
e.g. like (simpler code for this case here)

# disable console logging for kernel messages
# when not in unattended mode:
cat /proc/cmdline | grep -qw unattended || dmesg -n1

By the way:
A simple grep -w WORD may not work sufficiently fail-safe

# echo 'foo-unattended' | grep -w unattended && echo y || echo n
foo-unattended
y

# echo 'foo_unattended' | grep -w unattended && echo y || echo n
n

because in grep a word consists of
alphanumeric characters and the underscore
so e.g. '-' is a word separator in grep
in contrast to bash where $IFS are word separators
which is the reason for my more (and hopefully sufficiently)
elaborated code in [skel/default]/etc/scripts/system-setup
by using an artificial bash array.

pcahyna · 2023-12-14T10:13:22Z

@jsmeix

Regardless what the actual reason is for dmesg -n1
it seems that at least "under normal circumstances"
kernel log messages on the console are not wanted.

kernel log messages on the console are useful for diagnostics and thus wanted when circumstances are not normal anymore. Here is the excerpt from dmesg -h that explains the log levels (can be used as arguments to -n ):

Supported log levels (priorities):
   emerg - system is unusable
   alert - action must be taken immediately
    crit - critical conditions
     err - error conditions
    warn - warning conditions
  notice - normal but significant condition
    info - informational
   debug - debug-level messages

When debugging #2777 I tested what was needed to make the XFS error appear on the console, and took notes. It was necessary to set console the log level to at least 5. For this reason, I would like to raise the log level to at least 5 even if not in the unattended mode.

pcahyna · 2023-12-14T10:27:21Z

I mean, raise from the current setting, the kernel default may be even higher.

jsmeix · 2023-12-14T10:52:53Z

So dmesg log level 4 "err - error conditions" is insufficient
to make the XFS error appear on the console and you needed
dmesg log level 5 "warn - warning conditions".
Sigh, see
https://schlomo.schapiro.org/2015/04/warning-is-waste-of-my-time.html

pcahyna · 2023-12-14T11:37:09Z

Yes, it is clearly an error because the kernel could not do what was requested. Unfortunately, not much we can do about it.

jsmeix · 2023-12-14T12:04:57Z

I like 'dmesg -n 5' very much.
I tested "rear recover" up to 'dmesg -n 7'
see #3108

I think (according to what I get on my test system)
'dmesg -n 5' provides really a very good balance
between helpfulness to be able to diagnose problems
and not disturbing the user with needless information.

jsmeix · 2023-12-14T12:08:03Z

I am thinking about to call dmesg -n [4-7] in sbin/rear
depending on verbose and debug modes for ReaR e.g. like
dmesg -n 4 as minimum i.e. in non verbose mode
dmesg -n 5 in verbose mode (matches the "rear recover" default)
dmesg -n 6 in debug mode
dmesg -n 7 in debugscript mode
but only inside the recovery system
so no change what kernel messages appear on the console
when ReaR runs on the original system.

pcahyna · 2023-12-14T12:17:56Z

verbose mode is set in ReaR, not on boot, though. I believe that the only appropriate option that you can set on the kernel command line before booting is debug (and perhaps unattended and auto_recover/automatic if you want to set the log level according to these).

pcahyna · 2023-12-14T12:21:18Z

Ah, you wrote "in sbin/rear". Then it should be removed from the startup script to avoid setting the console level at multiple places.

jsmeix · 2023-12-14T12:42:51Z

I like 'dmesg -n 5' in [skel/default]/etc/scripts/boot
to get kernel errors and warnings during
ReaR recovery system startup.

With 'dmesg -n 5' in [skel/default]/etc/scripts/boot
I do not get any kernel error or warning message
during ReaR recovery system startup on my test VM
so normally 'dmesg -n 5' is sufficiently quiet.

In contrast without any 'dmesg -n ...' in /etc/scripts/boot
I get far too much kernel messages
during ReaR recovery system startup on my test VM
that disturb the ReaR recovery system startup messages
and in particular those needless kernel messages make
the intended ReaR recovery system startup messages
scroll away out of sight too soon on console.

I guess that this was the reason behind for
the 'dmesg -n1' in /etc/scripts/boot
but I think 'dmesg -n1' is much too hard and
makes the whole kernel messages basically useless
because I think it is not helpful when the user can
only see an emergency message when the system is unusable.

jsmeix · 2023-12-14T12:52:27Z

Regarding 'dmesg -n [4-7]' in sbin/rear:

I like that too.

I tested various manual 'dmesg -n [4-7]' before calling "rear recover"
and got different amounts of kernel messages on the console
in particular during disk layout recreation.

I.e. ReaR recovery system startup was with 'dmesg -n 5' and
then I set a different 'dmesg -n [4-7]' before "rear recover".

I think it is helpful for debugging problems
when debug and debugscript mode for "rear recover"
automatically increases the level of the kernel messages.

jsmeix · 2023-12-14T13:14:42Z

For the fun of it:

I tried with 'dmesg -n 8' before "rear recover"
which shows way too many kernel messages - as expected.

In particular many kernel messages during disk layout recreation.
Several screens full of kernel messages so most of it
gets lost out of sight by fast scrolling up on console.
On my test VM I have only one disk with a few partitions.

So 'dmesg -n 8' is over the top for "rear recover".

I wonder what "debug - debug-level messages" in 'dmesg -h' means.
Is dmesg log level 8 meant for debugging the kernel itself
or meant for debugging issues outside of the kernel?

jsmeix · 2023-12-14T14:02:49Z

@pcahyna
regarding
#3107 (comment)

the whole ISO gets loaded into a ramdisk
...
suspect that we are out of memory
...
Memory: 2052384K/3503836K available

It seems about 2GB of memory are available.

What does this ISO contain?

A usual ReaR recovery system ISO is about 200MB and with

FIRMWARE_FILES=( 'no' )
MODULES=( 'loaded_modules' )

I get one with about 70MB on my test VM.

Does your ISO perhaps also contain the backup?

pcahyna · 2023-12-14T14:14:41Z

Does your ISO perhaps also contain the backup?

That's exactly it. There is no other place than RAM to store the backup so that it survives layout recreation.

pcahyna · 2023-12-14T14:28:53Z

In #3108, the test errored again, the VM just froze. I reran it and there is actually lots of useful output on console:

Running PRE_RECOVERY_SCRIPT 'mkdir /tmp/mnt; mount /dev/nvme0n1p1 /tmp/mnt/; modprobe brd rd_nr=1 rd_size=2097152; dd if=/tmp/mnt//var/lib/rear/output/rear-ip-172-31-27-161.iso of=/dev/ram0; umount /tmp/mnt/;'
[   36.536328] dd invoked oom-killer: gfp_mask=0x6201c0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[   36.537813] CPU: 0 PID: 1688 Comm: dd Not tainted 4.18.0-527.el8.x86_64 #1
[   36.538915] Hardware name: Amazon EC2 c6a.large/, BIOS 1.0 10/16/2017
[   36.539946] Call Trace:
[   36.540354]  dump_stack+0x41/0x60
[   36.540897]  dump_header+0x4a/0x1df
[   36.541466]  oom_kill_process.cold.33+0xb/0x10
[   36.542183]  out_of_memory+0x1bd/0x4e0
[   36.542792]  __alloc_pages_slowpath+0xbf0/0xcd0
[   36.543523]  __alloc_pages_nodemask+0x2e2/0x330
[   36.544261]  pagecache_get_page+0xce/0x310
[   36.544926]  ? blkdev_direct_IO+0x490/0x490
[   36.545604]  grab_cache_page_write_begin+0x1f/0x40
[   36.546374]  block_write_begin+0x24/0xd0
[   36.547011]  generic_perform_write+0xf2/0x1b0
[   36.547714]  ? file_update_time+0x62/0x130
[   36.548380]  __generic_file_write_iter+0x102/0x1d0
[   36.549149]  blkdev_write_iter+0xb8/0x160
[   36.549797]  ? __indirect_thunk_end+0x1e67de/0x1e67de
[   36.550607]  new_sync_write+0x112/0x160
[   36.551231]  vfs_write+0xa5/0x1b0
[   36.551778]  ksys_write+0x4f/0xb0
[   36.552323]  do_syscall_64+0x5b/0x1b0
[   36.552919]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[   36.553725] RIP: 0033:0x7fcff7bf4de8
[   36.554307] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 49 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[   36.557238] RSP: 002b:00007fff2fad50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   36.558436] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcff7bf4de8
[   36.559564] RDX: 0000000000000200 RSI: 0000563f558ca000 RDI: 0000000000000001
[   36.560697] RBP: 0000000000000200 R08: 0000000000000000 R09: 0000563f558ca060
[   36.561824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000563f558ca000
[   36.562950] R13: 0000000000000800 R14: 0000000000000200 R15: 0000563f558c8200
[   36.564101] Mem-Info:
[   36.564556] active_anon:29 inactive_anon:4913 isolated_anon:0
[   36.564556]  active_file:0 inactive_file:0 isolated_file:0
[   36.564556]  unevictable:281131 dirty:0 writeback:0
[   36.564556]  slab_reclaimable:8446 slab_unreclaimable:3122
[   36.564556]  mapped:3080 shmem:2099 pagetables:234 bounce:0
[   36.564556]  free:20552 free_pcp:0 free_cma:0
[   36.569507] Node 0 active_anon:116kB inactive_anon:19652kB active_file:0kB inactive_file:0kB unevictable:1124524kB isolated(anon):0kB isolated(file):0kB mapped:12320kB dirty:0kB writeback:0kB shmem:8396kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:1280kB pagetables:936kB all_unreclaimable? yes
[   36.574128] Node 0 DMA free:11060kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.577808] lowmem_reserve[]: 0 1809 2671 2671 2671
[   36.578617] Node 0 DMA32 free:49096kB min:45508kB low:56884kB high:68260kB active_anon:116kB inactive_anon:19652kB active_file:0kB inactive_file:0kB unevictable:304596kB writepending:0kB present:2543716kB managed:2433488kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.582586] lowmem_reserve[]: 0 0 862 862 862
[   36.583304] Node 0 Normal free:22052kB min:21692kB low:27112kB high:32532kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:819928kB writepending:0kB present:944128kB managed:883200kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.587151] lowmem_reserve[]: 0 0 0 0 0
[   36.587776] Node 0 DMA: 5*4kB (UE) 2*8kB (UE) 1*16kB (E) 2*32kB (UE) 1*64kB (E) 1*128kB (E) 2*256kB (UE) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 1*4096kB (M) = 11060kB
[   36.590146] Node 0 DMA32: 300*4kB (UME) 385*8kB (UME) 241*16kB (UME) 182*32kB (UME) 89*64kB (UME) 34*128kB (ME) 14*256kB (UME) 16*512kB (UME) 5*1024kB (M) 4*2048kB (M) 0*4096kB = 49096kB
[   36.592782] Node 0 Normal: 3*4kB (UE) 3*8kB (UME) 4*16kB (UM) 4*32kB (UME) 3*64kB (UME) 3*128kB (UME) 3*256kB (UME) 2*512kB (ME) 1*1024kB (E) 1*2048kB (E) 4*4096kB (M) = 22052kB
[   36.595286] 283230 total pagecache pages
[   36.595932] 0 pages in swap cache
[   36.596602] Swap cache stats: add 0, delete 0, find 0/0
[   36.597455] Free swap  = 0kB
[   36.597926] Total swap = 0kB
[   36.598419] 875959 pages RAM
[   36.598897] 0 pages HighMem/MovableOnly
[   36.599527] 42947 pages reserved
[   36.600079] 0 pages hwpoisoned
[   36.600627] Out of memory: Killed process 193 (systemd-journal) total-vm:87436kB, anon-rss:836kB, file-rss:6388kB, shmem-rss:872kB, UID:0 pgtables:196kB oom_score_adj:0
[   36.603686] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
[   36.605604] systemd[1]: systemd-journald.service: Failed with result 'signal'.
[   36.643294] dd invoked oom-killer: gfp_mask=0x6201c0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[   36.644768] CPU: 0 PID: 1688 Comm: dd Not tainted 4.18.0-527.el8.x86_64 #1
[   36.645868] Hardware name: Amazon EC2 c6a.large/, BIOS 1.0 10/16/2017
[   36.646882] Call Trace:
[   36.647305]  dump_stack+0x41/0x60
[   36.647846]  dump_header+0x4a/0x1df
[   36.648417]  oom_kill_process.cold.33+0xb/0x10
[   36.649143]  out_of_memory+0x1bd/0x4e0
[   36.649744]  __alloc_pages_slowpath+0xbf0/0xcd0
[   36.650472]  __alloc_pages_nodemask+0x2e2/0x330
[   36.651214]  pagecache_get_page+0xce/0x310
[   36.651877]  ? blkdev_direct_IO+0x490/0x490
[   36.652569]  grab_cache_page_write_begin+0x1f/0x40
[   36.653346]  block_write_begin+0x24/0xd0
[   36.653974]  generic_perform_write+0xf2/0x1b0
[   36.654669]  ? file_update_time+0x62/0x130
[   36.655344]  __generic_file_write_iter+0x102/0x1d0
[   36.656112]  blkdev_write_iter+0xb8/0x160
[   36.656757]  ? __indirect_thunk_end+0x1e67de/0x1e67de
[   36.657576]  new_sync_write+0x112/0x160
[   36.658194]  vfs_write+0xa5/0x1b0
[   36.658734]  ksys_write+0x4f/0xb0
[   36.659287]  do_syscall_64+0x5b/0x1b0
[   36.659888]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[   36.660694] RIP: 0033:0x7fcff7bf4de8
[   36.661287] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 49 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[   36.664222] RSP: 002b:00007fff2fad50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   36.665425] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcff7bf4de8
[   36.666545] RDX: 0000000000000200 RSI: 0000563f558ca000 RDI: 0000000000000001
[   36.667682] RBP: 0000000000000200 R08: 0000000000000000 R09: 0000563f558ca060
[   36.668809] R10: 0000000000000000 R11: 0000000000000246 R12: 0000563f558ca000
[   36.669946] R13: 0000000000000800 R14: 0000000000000200 R15: 0000563f558c8200
[   36.671104] Mem-Info:
[   36.671480] active_anon:29 inactive_anon:4704 isolated_anon:0
[   36.671480]  active_file:1 inactive_file:11 isolated_file:0
[   36.671480]  unevictable:281131 dirty:1 writeback:0
[   36.671480]  slab_reclaimable:8439 slab_unreclaimable:3106
[   36.671480]  mapped:2808 shmem:2099 pagetables:187 bounce:0
[   36.671480]  free:20412 free_pcp:0 free_cma:0
[   36.676395] Node 0 active_anon:116kB inactive_anon:18816kB active_file:4kB inactive_file:44kB unevictable:1124524kB isolated(anon):0kB isolated(file):0kB mapped:11232kB dirty:4kB writeback:0kB shmem:8396kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:1264kB pagetables:748kB all_unreclaimable? yes
[   36.680976] Node 0 DMA free:11060kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.684616] lowmem_reserve[]: 0 1809 2671 2671 2671
[   36.685422] Node 0 DMA32 free:48924kB min:45508kB low:56884kB high:68260kB active_anon:116kB inactive_anon:18816kB active_file:0kB inactive_file:0kB unevictable:304596kB writepending:0kB present:2543716kB managed:2433488kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.689355] lowmem_reserve[]: 0 0 862 862 862
[   36.690064] Node 0 Normal free:21664kB min:21692kB low:27112kB high:32532kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:44kB unevictable:819928kB writepending:4kB present:944128kB managed:883200kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.693932] lowmem_reserve[]: 0 0 0 0 0
[   36.694563] Node 0 DMA: 5*4kB (UE) 2*8kB (UE) 1*16kB (E) 2*32kB (UE) 1*64kB (E) 1*128kB (E) 2*256kB (UE) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 1*4096kB (M) = 11060kB
[   36.697068] Node 0 DMA32: 328*4kB (UME) 398*8kB (UME) 246*16kB (UME) 187*32kB (UME) 93*64kB (UME) 37*128kB (UME) 13*256kB (ME) 15*512kB (ME) 5*1024kB (M) 4*2048kB (M) 0*4096kB = 49424kB
[   36.699697] Node 0 Normal: 1*4kB (E) 5*8kB (UME) 4*16kB (UM) 3*32kB (UME) 2*64kB (ME) 3*128kB (UME) 2*256kB (ME) 2*512kB (ME) 1*1024kB (E) 1*2048kB (E) 4*4096kB (M) = 21708kB
[   36.702181] 283242 total pagecache pages
[   36.702813] 0 pages in swap cache
[   36.703378] Swap cache stats: add 0, delete 0, find 0/0
[   36.704231] Free swap  = 0kB
[   36.704709] Total swap = 0kB
[   36.705201] 875959 pages RAM
[   36.705669] 0 pages HighMem/MovableOnly
[   36.706307] 42947 pages reserved
[   36.706828] 0 pages hwpoisoned
[   36.707383] Out of memory: Killed process 545 (rear) total-vm:15912kB, anon-rss:4212kB, file-rss:2676kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
/etc/scripts/system-setup: line 255:   545 Killed                  rear $rear_options recover
'eval rear $rear_options recover' results exit code 137
'rear recover' failed with exit code 137
1) View Relax-and-Recover log file(s)
2) Login at the rescue system
Select what to do [   36.849846] dd invoked oom-killer: gfp_mask=0x6201c0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[   36.851330] CPU: 1 PID: 1688 Comm: dd Not tainted 4.18.0-527.el8.x86_64 #1
[   36.852424] Hardware name: Amazon EC2 c6a.large/, BIOS 1.0 10/16/2017
[   36.853452] Call Trace:
[   36.853857]  dump_stack+0x41/0x60
[   36.854401]  dump_header+0x4a/0x1df
[   36.854967]  oom_kill_process.cold.33+0xb/0x10
[   36.855682]  out_of_memory+0x1bd/0x4e0
[   36.856298]  __alloc_pages_slowpath+0xbf0/0xcd0
[   36.857029]  __alloc_pages_nodemask+0x2e2/0x330
[   36.857752]  pagecache_get_page+0xce/0x310
[   36.858409]  ? blkdev_direct_IO+0x490/0x490
[   36.859085]  grab_cache_page_write_begin+0x1f/0x40
[   36.859848]  block_write_begin+0x24/0xd0
[   36.860489]  generic_perform_write+0xf2/0x1b0
[   36.861186]  ? generic_update_time+0xba/0xd0
[   36.861871]  ? file_update_time+0xf1/0x130
[   36.862530]  __generic_file_write_iter+0x102/0x1d0
[   36.863295]  blkdev_write_iter+0xb8/0x160
[   36.863943]  ? __indirect_thunk_end+0x1e67de/0x1e67de
[   36.864751]  new_sync_write+0x112/0x160
[   36.865375]  vfs_write+0xa5/0x1b0
[   36.865913]  ksys_write+0x4f/0xb0
[   36.866453]  do_syscall_64+0x5b/0x1b0
[   36.867051]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[   36.867858] RIP: 0033:0x7fcff7bf4de8
[   36.868437] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 49 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[   36.871358] RSP: 002b:00007fff2fad50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   36.872566] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcff7bf4de8
[   36.873694] RDX: 0000000000000200 RSI: 0000563f558ca000 RDI: 0000000000000001
[   36.874821] RBP: 0000000000000200 R08: 0000000000000000 R09: 0000563f558ca060
[   36.875958] R10: 0000000000000000 R11: 0000000000000246 R12: 0000563f558ca000
[   36.877084] R13: 0000000000000800 R14: 0000000000000200 R15: 0000563f558c8200
[   36.878222] Mem-Info:
[   36.878600] active_anon:28 inactive_anon:3652 isolated_anon:0
[   36.878600]  active_file:3 inactive_file:0 isolated_file:0
[   36.878600]  unevictable:281131 dirty:1 writeback:1
[   36.878600]  slab_reclaimable:8438 slab_unreclaimable:3104
[   36.878600]  mapped:2779 shmem:2099 pagetables:171 bounce:0
[   36.878600]  free:20396 free_pcp:0 free_cma:0
[   36.883488] Node 0 active_anon:112kB inactive_anon:14608kB active_file:12kB inactive_file:0kB unevictable:1124524kB isolated(anon):0kB isolated(file):0kB mapped:11116kB dirty:4kB writeback:4kB shmem:8396kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:1248kB pagetables:684kB all_unreclaimable? yes
[   36.888064] Node 0 DMA free:11060kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.891678] lowmem_reserve[]: 0 1809 2671 2671 2671
[   36.892484] Node 0 DMA32 free:48936kB min:45508kB low:56884kB high:68260kB active_anon:112kB inactive_anon:14608kB active_file:12kB inactive_file:0kB unevictable:304596kB writepending:8kB present:2543716kB managed:2433488kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.896411] lowmem_reserve[]: 0 0 862 862 862
[   36.897180] Node 0 Normal free:21588kB min:21692kB low:27112kB high:32532kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:819928kB writepending:0kB present:944128kB managed:883200kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   36.901082] lowmem_reserve[]: 0 0 0 0 0
[   36.901708] Node 0 DMA: 5*4kB (UE) 2*8kB (UE) 1*16kB (E) 2*32kB (UE) 1*64kB (E) 1*128kB (E) 2*256kB (UE) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 1*4096kB (M) = 11060kB
[   36.904108] Node 0 DMA32: 291*4kB (UME) 506*8kB (UME) 342*16kB (UME) 231*32kB (UME) 98*64kB (UME) 36*128kB (UME) 12*256kB (ME) 16*512kB (ME) 5*1024kB (M) 2*2048kB (UM) 0*4096kB = 49436kB
[   36.906759] Node 0 Normal: 2*4kB (UE) 4*8kB (UME) 2*16kB (UM) 3*32kB (UME) 3*64kB (UME) 2*128kB (ME) 2*256kB (ME) 2*512kB (ME) 1*1024kB (E) 1*2048kB (E) 4*4096kB (UM) = 21608kB
[   36.909273] 283232 total pagecache pages
[   36.909904] 0 pages in swap cache
[   36.910472] Swap cache stats: add 0, delete 0, find 0/0
[   36.911318] Free swap  = 0kB
[   36.911789] Total swap = 0kB
[   36.912290] 875959 pages RAM
[   36.912760] 0 pages HighMem/MovableOnly
[   36.913389] 42947 pages reserved
[   36.913912] 0 pages hwpoisoned
[   36.914465] Out of memory: Killed process 197 (system-setup) total-vm:12872kB, anon-rss:1180kB, file-rss:2552kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[�[0;1;31mFAILED�[0m] Failed to start Initialize Rescue System.
See 'systemctl status sysinit.service' for details.
[�[0;32m  OK  �[0m] Started Getty on tty0.
[�[0;32m  OK  �[0m] Reached target Login Prompts.
[�[0;32m  OK  �[0m] Started Serial Getty on ttyS0.
[�[0;32m  OK  �[0m] Reached target Multi-User.

pcahyna · 2023-12-14T15:43:16Z

@jsmeix ok, I agree that it makes sense to set console log level both in system setup and rear recover.

jsmeix · 2023-12-15T09:13:30Z

I will implement setting the console log level
for /sbin/rear in a separated pull request.

jsmeix · 2023-12-15T09:26:08Z

@pcahyna
out of curiosity:
How big are such test backups ?
How much disk space need such test systems (df -h) ?

For comparison:
I installed a rather minimal SLES15-SP5 system
with all in a single ext4 root '/' partition
and got

# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/vda2       8.3G  2.2G  5.7G  28% /

...  74M Dec 15 11:17 rear-localhost.iso
... 868M Dec 15 11:20 backup.tar.gz

so ISO plus backup below 1GB
in particular with this in local.conf (excerpt):

FIRMWARE_FILES=( 'no' )
MODULES=( 'loaded_modules' )

In [skel/default]/etc/scripts/boot set dmesg -n 5 to limit console logging for 'dmesg' messages to level 5 so that kernel error and warning messages appear (intermixed with ReaR messages) on the console so that the user can notice when things go wrong in kernel area which helps to understand problems, see the related issue #3107

pcahyna · 2023-12-18T11:38:11Z

Last time the test succeeded seems to be here https://github.com/rear/rear/runs/19517989303 . I suspect it may be some kernel change, the log at that time shows 4.18.0-526.el8.x86_64.

pcahyna · 2023-12-18T11:45:05Z

@pcahyna out of curiosity: How big are such test backups ? How much disk space need such test systems (df -h) ?

For comparison: I installed a rather minimal SLES15-SP5 system with all in a single ext4 root '/' partition and got
# df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/vda2       8.3G  2.2G  5.7G  28% /
...  74M Dec 15 11:17 rear-localhost.iso
... 868M Dec 15 11:20 backup.tar.gz
so ISO plus backup below 1GB.

@jsmeix good question!

Failed run has this:

Running 'pack' stage ======================
Creating recovery/rescue system initramfs/initrd initrd.cgz with gzip default compression
Created initrd.cgz with gzip default compression (585960054 bytes) in 36 seconds
Running 'backup' stage ======================
Making backup (using backup method NETFS)
Creating tar archive '/var/tmp/rear.TLe4h8iNwow9QvX/tmp/isofs/backup/backup.tar.gz'
Preparing archive operationOK
Archived 1542 MiB in 149 seconds [avg 10597 KiB/sec]
Running 'output' stage ======================
Searching whole /usr for SYSLINUX modules directory (you may specify SYSLINUX_MODULES_DIR)
Making ISO image
Wrote ISO image: /var/lib/rear/output/rear-ip-172-31-28-157.iso (2.1G)

https://artifacts.dev.testing-farm.io/f6bd055b-71dd-4cb7-9550-89007881c9fa/work-backup-and-restoretizyhdqu/tests/plans/backup-and-restore/execute/data/guest/default-0/make-backup-and-restore-iso-1/output.txt - hidden in a very deep directory unfortunately.

Successful run had this:

Running 'pack' stage ======================
Creating recovery/rescue system initramfs/initrd initrd.cgz with gzip default compression
Created initrd.cgz with gzip default compression (117249126 bytes) in 7 seconds
Running 'backup' stage ======================
Making backup (using backup method NETFS)
Creating tar archive '/var/tmp/rear.6NxWWL8y4WQkIn9/tmp/isofs/backup/backup.tar.gz'
Preparing archive operationOK
Archived 816 MiB in 108 seconds [avg 7741 KiB/sec]
Running 'output' stage ======================
Searching whole /usr for SYSLINUX modules directory (you may specify SYSLINUX_MODULES_DIR)
Making ISO image
Wrote ISO image: /var/lib/rear/output/rear-ip-172-31-17-145.iso (963M)

https://artifacts.dev.testing-farm.io/df70fe97-ba51-466e-8472-ba72bead8bd9/
so, a huge change both in the initrd size and the whole backup size.

jsmeix · 2023-12-18T12:35:20Z

@pcahyna
to find big files and directories you may run
inside those test systems a command like

# du -amt123M / 2>/dev/null

e.g. here to show all files and directories
that need more than 123 MB disk space.

For example on my above rather minimal SLES15-SP5 test VM:

# du -amt70M / 2>/dev/null
108     /usr/lib64/libLLVM.so.15
390     /usr/lib64
95      /usr/bin
227     /usr/lib/locale
382     /usr/lib
101     /usr/share/locale
304     /usr/share
1196    /usr
72      /lib/firmware/mrvl/prestera
77      /lib/firmware/mrvl
359     /lib/firmware
78      /lib/modules/5.14.21-150500.53-default/kernel
82      /lib/modules/5.14.21-150500.53-default
82      /lib/modules
440     /lib
74      /root/rear/var/lib/rear/output/rear-localhost.iso
74      /root/rear/var/lib/rear/output
74      /root/rear/var/lib/rear
74      /root/rear/var/lib
77      /root/rear/var
99      /root/rear
99      /root
1846    /

jsmeix · 2023-12-18T12:39:42Z

Regarding "suspect it may be some kernel change":

In this case I think "the usual suspects"
are new and big firmware files or kernel modules
which may make the backup big.

To avoid a needless big ReaR recovery system initrd
(in particular on a VM where firmware is not needed)

FIRMWARE_FILES=( 'no' )
MODULES=( 'loaded_modules' )

is basically mandatory.

pcahyna · 2023-12-18T18:13:58Z

@jsmeix indeed that was it, thank you for the tips:

# du -amt70M  /tmp/rear.u4cUxdVyXNiz1qd/rootfs/ 2>/dev/null 
82      /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/mellanox
78      /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/amdgpu
72      /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/mrvl/prestera
81      /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/mrvl
167     /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/qcom
141     /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware/netronome
1065    /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib/firmware
1157    /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr/lib
1221    /tmp/rear.u4cUxdVyXNiz1qd/rootfs/usr
1278    /tmp/rear.u4cUxdVyXNiz1qd/rootfs/

The size of the firmware increased a bit, but that should not be the reason. It seems that the test started to fail and used to pass before because before there was no linux-firmware package installed on the test machines and now there is. And the centos-stream-9 test passes because the package is still absent in the testing image. I made an experiment in pcahyna#14 and indeed:
centos-stream-8 :

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Is linux-firmware installed?
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

linux-firmware-20231121-120.git9552083a.el8.noarch
:: [ 15:36:58 ] :: [   PASS   ] :: Checking for the presence of linux-firmware rpm 
:: [ 15:36:58 ] :: [   LOG    ] :: Package versions:
:: [ 15:36:58 ] :: [   LOG    ] ::   linux-firmware-20231121-120.git9552083a.el8.noarch
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Duration: 0s
::   Assertions: 1 good, 0 bad
::   RESULT: PASS

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Find huge files in ReaR rootfs
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [ 15:41:30 ] :: [  BEGIN   ] :: Running 'du -amt70M  /var/tmp/rear.*/rootfs/ 2>/dev/null'
82	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/mellanox
78	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/amdgpu
72	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/mrvl/prestera
81	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/mrvl
167	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/qcom
141	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware/netronome
891	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib/firmware
983	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr/lib
1045	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/usr
1094	/var/tmp/rear.xfXZ7nnhbYX0QU3/rootfs/
:: [ 15:41:30 ] :: [   PASS   ] :: Command 'du -amt70M  /var/tmp/rear.*/rootfs/ 2>/dev/null' (Expected 0, got 0)

centos-stream-9 :

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Is linux-firmware installed?
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

package linux-firmware is not installed
:: [ 15:31:26 ] :: [   FAIL   ] :: Checking for the presence of linux-firmware rpm 
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Duration: 0s
::   Assertions: 0 good, 1 bad
::   RESULT: WARN

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Find huge files in ReaR rootfs
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [ 15:34:29 ] :: [  BEGIN   ] :: Running 'du -amt70M  /var/tmp/rear.*/rootfs/ 2>/dev/null'
87	/var/tmp/rear.Gh4BPLBzHp7pzvc/rootfs/usr/lib/modules/5.14.0-391.el9.x86_64
87	/var/tmp/rear.Gh4BPLBzHp7pzvc/rootfs/usr/lib/modules
107	/var/tmp/rear.Gh4BPLBzHp7pzvc/rootfs/usr/lib
171	/var/tmp/rear.Gh4BPLBzHp7pzvc/rootfs/usr
223	/var/tmp/rear.Gh4BPLBzHp7pzvc/rootfs/
:: [ 15:34:29 ] :: [   PASS   ] :: Command 'du -amt70M  /var/tmp/rear.*/rootfs/ 2>/dev/null' (Expected 0, got 0)

I will add FIRMWARE_FILES=( 'no' ) to the configuration in the CI test.

jsmeix · 2023-12-19T06:38:49Z

@pcahyna
because in your
#3107 (comment)
you wrote there is
"a huge change both in the initrd size and the whole backup size"
I think that FIRMWARE_FILES=( 'no' ) alone
does not fully solve it because this affects
only the size of the ReaR recovery system initrd
but it does not affect the size of the backup.

So I think that additionally you need to
exclude the firmware files from the backup.

Alternative it should work to not install (or uninstall)
the linux-firmware package because then there should be
no firmware on the original system so that firmware
can neither be in the ReaR recovery system initrd
nor in the backup.

In sbin/rear find out if we're running inside the recovery system and if yes call 'dmesg -n [4-7]' depending on verbose and debug modes for ReaR, see #3107 and #3108

pcahyna · 2023-12-19T12:47:57Z

@pcahyna because in your #3107 (comment) you wrote there is "a huge change both in the initrd size and the whole backup size" I think that FIRMWARE_FILES=( 'no' ) alone does not fully solve it because this affects only the size of the ReaR recovery system initrd but it does not affect the size of the backup.

So I think that additionally you need to exclude the firmware files from the backup.

Alternative it should work to not install (or uninstall) the linux-firmware package because then there should be no firmware on the original system so that firmware can neither be in the ReaR recovery system initrd nor in the backup.

FIRMWARE_FILES=( no ) is enough. Successful run: https://github.com/pcahyna/rear/pull/14/checks?check_run_id=19784045747 .
Without it, the firmware files will appear 2x in the ISO: once in the bootable initrd and second time in the backup. I also suspect that they take up space in RAM because of the booted ISO image (used by memdisk to emulate the CD-ROM for the bootloader - not sure whether this kind of bootloader memory gets freed by the kernel after boot) and again in the booted initramfs (running recovery system).
So, FIRMWARE_FILES=( no ) saves a lot of space even if the files remain in the backup inside the ISO.

pcahyna · 2023-12-19T16:14:53Z

@jsmeix thanks a lot for the helpful suggestions, the test now passes as you can see in your latest PR (I retriggered it manually).

pcahyna added the ReaR Project The ReaR project itself and the tooling we use to work on it. label Dec 13, 2023

pcahyna self-assigned this Dec 13, 2023

jsmeix mentioned this issue Dec 14, 2023

Set 'dmesg -n 5' in etc/scripts/boot #3108

Merged

jsmeix added a commit that referenced this issue Dec 19, 2023

Set 'dmesg -n [4-7]' in sbin/rear

d37d517

In sbin/rear find out if we're running inside the recovery system and if yes call 'dmesg -n [4-7]' depending on verbose and debug modes for ReaR, see #3107 and #3108

jsmeix mentioned this issue Dec 19, 2023

Set 'dmesg -n [4-7]' in sbin/rear #3112

Closed

pcahyna mentioned this issue Dec 19, 2023

backup/recovery test: exclude firmware files to reduce initrd size rear/rear-integration-tests#4

Merged

pcahyna closed this as completed in rear/rear-integration-tests#4 Dec 19, 2023

jsmeix added the fixed / solved / done label Dec 20, 2023

jsmeix mentioned this issue Dec 22, 2023

Set 'dmesg -n [4-6]' in new 'setup' script 007_set_dmesg_level.sh #3113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug intermittent backup/recovery CI errors #3107

Debug intermittent backup/recovery CI errors #3107

pcahyna commented Dec 13, 2023

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023 •

edited

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 15, 2023

jsmeix commented Dec 15, 2023 •

edited

pcahyna commented Dec 18, 2023

pcahyna commented Dec 18, 2023

jsmeix commented Dec 18, 2023

jsmeix commented Dec 18, 2023

pcahyna commented Dec 18, 2023

jsmeix commented Dec 19, 2023 •

edited

pcahyna commented Dec 19, 2023

pcahyna commented Dec 19, 2023

Debug intermittent backup/recovery CI errors #3107

Debug intermittent backup/recovery CI errors #3107

Comments

pcahyna commented Dec 13, 2023

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023 • edited

jsmeix commented Dec 14, 2023 • edited

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 14, 2023

jsmeix commented Dec 14, 2023 • edited

jsmeix commented Dec 14, 2023 • edited

jsmeix commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

pcahyna commented Dec 14, 2023

jsmeix commented Dec 15, 2023

jsmeix commented Dec 15, 2023 • edited

pcahyna commented Dec 18, 2023

pcahyna commented Dec 18, 2023

jsmeix commented Dec 18, 2023

jsmeix commented Dec 18, 2023

pcahyna commented Dec 18, 2023

jsmeix commented Dec 19, 2023 • edited

pcahyna commented Dec 19, 2023

pcahyna commented Dec 19, 2023

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 14, 2023 •

edited

jsmeix commented Dec 15, 2023 •

edited

jsmeix commented Dec 19, 2023 •

edited