Support restart of test when it crashes #2696

happz · 2024-02-21T15:33:23Z

As discussed today, there's a use case for restarting a test when it crashes:

09:26:12                 out: :: [ 14:26:12 ] :: [   PASS   ] :: Command 'make all' (Expected 0, got 0)
09:26:12                 out: :: [ 14:26:12 ] :: [  BEGIN   ] :: Running 'echo 1 > /sys/kernel/vkm/write_um_crash'
09:26:12                 out: ./tmt-test-wrapper.sh.default-0: line 1:  6543 Segmentation fault      bash ./write_um.sh
09:26:12                 out: Shared connection to 10.26.28.203 closed.
09:26:12         Command returned '139'.

In this case, the user would like to see the test restarted - the test was killed by a kernel oops, and when restarted, it would take care of follow-up steps, like decoding the kernel dump.

After some discussion, the proposal would be:

a test key to indicate the test shall be restarted when it crashes. Might be a list of exit codes, or tmt might define the list of crash-like exit codes, and this key would be a simple flag.
- Test restart on crash #2870
a test key to indicate how many times the test should be restarted. We need to avoid endless loops, and tmt should give up at some point. The default might be a zero, or a reasonably low value - the value would not be used unless the first key is enabled anyway.
- Test restart on crash #2870
a test key to indicate whether to reboot the guest before restarting the test. In this particular case, there should be no guest reboot, the test needs to re-enter the environment as it is.
- Test restart on crash #2870
ew environment variable, similar to TMT_REBOOT_COUNT, but counting test restarts. With reboot disabled, the test might run multiple times while TMT_REBOOT_COUNT remains zero.
- Add test restart counter, similar to TMT_REBOOT_COUNT #2787

The text was updated successfully, but these errors were encountered:

sbertramrh · 2024-02-27T20:28:52Z

Hi @happz and @lukaszachy I found a workaround for my case. By using nohup it no longer causes the test to abort and it continues through the error.

        # Read only crash test
        rlRun "nohup echo 1 > /sys/kernel/vkm/write_ro_crash" "0-255"
        while (! ping -q -c 1 ${SOC///*}); do
            sleep 5
        done
        rlRun "dmesg > dmesg-crash.log"
        rlAssertGrep "Unable to handle kernel write to read-only memory" dmesg-crash.log

result:

15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash'
15:00:13                 out: /usr/share/beakerlib/testing.sh: line 896:  1467 Segmentation fault      nohup echo 1 > /sys/kernel/vkm/write_ro_crash
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash' (Expected 0-255, got 139)
15:00:13                 out: PING 10.26.28.203 (10.26.28.203) 56(84) bytes of data.
15:00:13                 out: 
15:00:13                 out: --- 10.26.28.203 ping statistics ---
15:00:13                 out: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
15:00:13                 out: rtt min/avg/max/mdev = 0.046/0.046/0.046/0.000 ms
15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'dmesg > dmesg-crash.log'
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'dmesg > dmesg-crash.log' (Expected 0, got 0)
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: File 'dmesg-crash.log' should contain 'Unable to handle kernel write to read-only memory'

pablmart · 2024-03-11T13:59:30Z

Hello, @happz and @lukaszachy

I wrote a test that forcibly perform a stack underflow within a kernel module, that causes a BUG and subsequent restart after configuring 5 seconds of kernel.panic with sysctl

[ 1748.996748] BUG: unable to handle page fault for address: ffffaa90401e8000 [ 1748.996751] #PF: supervisor read access in kernel mode [ 1748.996752] #PF: error_code(0x0000) - not-present page [ 1748.996753] PGD 1800067 P4D 1800067 PUD 1a0e067 PMD 1a18067 PTE 0 [ 1748.996759] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 1748.996762] CPU: 3 PID: 50 Comm: ksoftirqd/3 Tainted: G OE X ------- --- 5.14.0-427.380.el9iv.x86_64 #1 [ 1748.996765] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023 [ 1748.996766] RIP: 0010:tasklet_fn+0x66/0x78 [stackman] [ 1748.996770] Code: 75 02 eb fe 58 ff c8 75 fb eb 1a 48 c7 44 24 10 79 56 34 12 e8 a7 fe ff ff 48 c7 c7 f8 10 86 c0 e8 8e fe 75 f2 b8 00 00 01 00 <58> ff c8 75 fb 48 c7 c7 b6 10 86 c0 5b e9 77 fe 75 f2 90 90 90 90

I tried with 'rstrnt-prepare-reboot' before loading the module that causes the crash, but tmt disconnects, tries to rsync and times out.

I think this one and other two tests for testing memory violation handling within the kernel are cases in favor of implementing this feature.

happz · 2024-03-25T14:16:11Z

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

To accomodate tests that can be restarted without a reboot, e.g. after a test crash or kernel panic, tmt would keep a test restart counter. It is similar to the well-established `TMT_REBOOT_COUNT`, but tracks test restarts. For most of the tests, both counters would have the same value. Related to #2696

pablmart · 2024-03-25T16:56:57Z

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

Yes the test is on the same repo linked in the above comment mentioning 'rstrnt-prepare-reboot':

kernel-stack-overflow-udnerflow-scribbling

To accomodate tests that can be restarted without a reboot, e.g. after a test crash or kernel panic, tmt would keep a test restart counter. It is similar to the well-established `TMT_REBOOT_COUNT`, but tracks test restarts. For most of the tests, both counters would have the same value. Related to #2696

weiwang-linda · 2024-04-10T03:48:24Z

I encountered a similar problem when testing ftrace= kernel parameter with tmt run.

Test with auto-osbuild-qemu-rhivos9-qa-ostree-aarch64-7874633.e1769674.qcow2.xz by manual

The available tracers are:
$cat /sys/kernel/debug/tracing/available_tracers
timerlat osnoise hwlat blk function_graph wakeup_dl wakeup_rt wakeup function nop

Install a vm with above image
export CMDLINEARGS="ftrace=timerlat"
rpm-ostree kargs --append-if-missing="${CMDLINEARGS##-}" --import-proc-cmdline
systemctl reboot
Then the host cannot ssh connect again. Only "timerlat" and "osnoise" make host panic.

To accomodate tests that can be restarted without a reboot, e.g. after a test crash or kernel panic, tmt would keep a test restart counter. It is similar to the well-established `TMT_REBOOT_COUNT`, but tracks test restarts. For most of the tests, both counters would have the same value. Related to #2696

happz · 2024-04-17T15:10:02Z

Kicking off the implementation of the actual test restart in #2870. It does have some rough edges, although there is a test that passes.

I plan to run it with the kernel-stack-overflow-udnerflow-scribbling test provided by @pablmart, feel free to experiment too.

One piece we need to address ASAP - naming. I picked some names for new keys, but they are ugly and I don't like them. I can change them easily, but I'm out of ideas - feel free to propose changes here as well, besides the actual bugs and issues :)

happz · 2024-04-29T12:36:50Z

A similar case: what if the test does not crash, but triggers a reboot, e.g. through Ansible role, unable to use tmt-reboot? This would manifest as a broken SSH session:

                out: TASK [sap_general_preconfigure : Flush handlers] *******************************
                out: 
                out: RUNNING HANDLER [sap_general_preconfigure : Reboot the managed node] ***********
                out: Shared connection to restqe01 closed.
            cmd: rsync --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: dnf --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: rpm-ostree --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: yum install -y rsync
            err: ssh: connect to host restqe01 port 22: Connection refused

To accomodate tests that can be restarted without a reboot, e.g. after a test crash or kernel panic, tmt would keep a test restart counter. It is similar to the well-established `TMT_REBOOT_COUNT`, but tracks test restarts. For most of the tests, both counters would have the same value. Related to #2696

pablmart · 2024-05-03T16:36:10Z

The MR 2870 solves the issue with the kernel-stack-overflow-underflow-scribbling test. Many thanks!

happz added execute Execute step tests labels Feb 21, 2024

lukaszachy added the specification label Feb 21, 2024

happz mentioned this issue Mar 25, 2024

Add test restart counter, similar to TMT_REBOOT_COUNT #2787

Merged

4 tasks

happz mentioned this issue Apr 25, 2024

Improve robustness of SSH connection to SUT #2890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support restart of test when it crashes #2696

Support restart of test when it crashes #2696

happz commented Feb 21, 2024 •

edited

sbertramrh commented Feb 27, 2024

pablmart commented Mar 11, 2024

happz commented Mar 25, 2024

pablmart commented Mar 25, 2024

weiwang-linda commented Apr 10, 2024 •

edited

happz commented Apr 17, 2024

happz commented Apr 29, 2024

pablmart commented May 3, 2024 •

edited

Support restart of test when it crashes #2696

Support restart of test when it crashes #2696

Comments

happz commented Feb 21, 2024 • edited

sbertramrh commented Feb 27, 2024

pablmart commented Mar 11, 2024

happz commented Mar 25, 2024

pablmart commented Mar 25, 2024

weiwang-linda commented Apr 10, 2024 • edited

happz commented Apr 17, 2024

happz commented Apr 29, 2024

pablmart commented May 3, 2024 • edited

happz commented Feb 21, 2024 •

edited

weiwang-linda commented Apr 10, 2024 •

edited

pablmart commented May 3, 2024 •

edited