Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support restart of test when it crashes #2696

Open
happz opened this issue Feb 21, 2024 · 8 comments
Open

Support restart of test when it crashes #2696

happz opened this issue Feb 21, 2024 · 8 comments

Comments

@happz
Copy link
Collaborator

happz commented Feb 21, 2024

As discussed today, there's a use case for restarting a test when it crashes:

09:26:12                 out: :: [ 14:26:12 ] :: [   PASS   ] :: Command 'make all' (Expected 0, got 0)
09:26:12                 out: :: [ 14:26:12 ] :: [  BEGIN   ] :: Running 'echo 1 > /sys/kernel/vkm/write_um_crash'
09:26:12                 out: ./tmt-test-wrapper.sh.default-0: line 1:  6543 Segmentation fault      bash ./write_um.sh
09:26:12                 out: Shared connection to 10.26.28.203 closed.
09:26:12         Command returned '139'.

In this case, the user would like to see the test restarted - the test was killed by a kernel oops, and when restarted, it would take care of follow-up steps, like decoding the kernel dump.

After some discussion, the proposal would be:

  • a test key to indicate the test shall be restarted when it crashes. Might be a list of exit codes, or tmt might define the list of crash-like exit codes, and this key would be a simple flag.
  • a test key to indicate how many times the test should be restarted. We need to avoid endless loops, and tmt should give up at some point. The default might be a zero, or a reasonably low value - the value would not be used unless the first key is enabled anyway.
  • a test key to indicate whether to reboot the guest before restarting the test. In this particular case, there should be no guest reboot, the test needs to re-enter the environment as it is.
  • ew environment variable, similar to TMT_REBOOT_COUNT, but counting test restarts. With reboot disabled, the test might run multiple times while TMT_REBOOT_COUNT remains zero.
@sbertramrh
Copy link

Hi @happz and @lukaszachy I found a workaround for my case. By using nohup it no longer causes the test to abort and it continues through the error.

        # Read only crash test
        rlRun "nohup echo 1 > /sys/kernel/vkm/write_ro_crash" "0-255"
        while (! ping -q -c 1 ${SOC///*}); do
            sleep 5
        done
        rlRun "dmesg > dmesg-crash.log"
        rlAssertGrep "Unable to handle kernel write to read-only memory" dmesg-crash.log

result:

15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash'
15:00:13                 out: /usr/share/beakerlib/testing.sh: line 896:  1467 Segmentation fault      nohup echo 1 > /sys/kernel/vkm/write_ro_crash
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash' (Expected 0-255, got 139)
15:00:13                 out: PING 10.26.28.203 (10.26.28.203) 56(84) bytes of data.
15:00:13                 out: 
15:00:13                 out: --- 10.26.28.203 ping statistics ---
15:00:13                 out: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
15:00:13                 out: rtt min/avg/max/mdev = 0.046/0.046/0.046/0.000 ms
15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'dmesg > dmesg-crash.log'
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'dmesg > dmesg-crash.log' (Expected 0, got 0)
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: File 'dmesg-crash.log' should contain 'Unable to handle kernel write to read-only memory' 

@pablmart
Copy link

Hello, @happz and @lukaszachy

I wrote a test that forcibly perform a stack underflow within a kernel module, that causes a BUG and subsequent restart after configuring 5 seconds of kernel.panic with sysctl

[ 1748.996748] BUG: unable to handle page fault for address: ffffaa90401e8000 [ 1748.996751] #PF: supervisor read access in kernel mode [ 1748.996752] #PF: error_code(0x0000) - not-present page [ 1748.996753] PGD 1800067 P4D 1800067 PUD 1a0e067 PMD 1a18067 PTE 0 [ 1748.996759] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 1748.996762] CPU: 3 PID: 50 Comm: ksoftirqd/3 Tainted: G OE X ------- --- 5.14.0-427.380.el9iv.x86_64 #1 [ 1748.996765] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023 [ 1748.996766] RIP: 0010:tasklet_fn+0x66/0x78 [stackman] [ 1748.996770] Code: 75 02 eb fe 58 ff c8 75 fb eb 1a 48 c7 44 24 10 79 56 34 12 e8 a7 fe ff ff 48 c7 c7 f8 10 86 c0 e8 8e fe 75 f2 b8 00 00 01 00 <58> ff c8 75 fb 48 c7 c7 b6 10 86 c0 5b e9 77 fe 75 f2 90 90 90 90

I tried with 'rstrnt-prepare-reboot' before loading the module that causes the crash, but tmt disconnects, tries to rsync and times out.

I think this one and other two tests for testing memory violation handling within the kernel are cases in favor of implementing this feature.

@happz
Copy link
Collaborator Author

happz commented Mar 25, 2024

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

happz added a commit that referenced this issue Mar 25, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
@pablmart
Copy link

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

Yes the test is on the same repo linked in the above comment mentioning 'rstrnt-prepare-reboot':

kernel-stack-overflow-udnerflow-scribbling

happz added a commit that referenced this issue Mar 27, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
@weiwang-linda
Copy link

weiwang-linda commented Apr 10, 2024

I encountered a similar problem when testing ftrace= kernel parameter with tmt run.

Test with auto-osbuild-qemu-rhivos9-qa-ostree-aarch64-7874633.e1769674.qcow2.xz by manual

The available tracers are:
$cat /sys/kernel/debug/tracing/available_tracers
timerlat osnoise hwlat blk function_graph wakeup_dl wakeup_rt wakeup function nop

  1. Install a vm with above image
  2. export CMDLINEARGS="ftrace=timerlat"
  3. rpm-ostree kargs --append-if-missing="${CMDLINEARGS##-}" --import-proc-cmdline
  4. systemctl reboot
    Then the host cannot ssh connect again. Only "timerlat" and "osnoise" make host panic.

happz added a commit that referenced this issue Apr 17, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
@happz
Copy link
Collaborator Author

happz commented Apr 17, 2024

Kicking off the implementation of the actual test restart in #2870. It does have some rough edges, although there is a test that passes.

I plan to run it with the kernel-stack-overflow-udnerflow-scribbling test provided by @pablmart, feel free to experiment too.

One piece we need to address ASAP - naming. I picked some names for new keys, but they are ugly and I don't like them. I can change them easily, but I'm out of ideas - feel free to propose changes here as well, besides the actual bugs and issues :)

@happz
Copy link
Collaborator Author

happz commented Apr 29, 2024

A similar case: what if the test does not crash, but triggers a reboot, e.g. through Ansible role, unable to use tmt-reboot? This would manifest as a broken SSH session:

                out: TASK [sap_general_preconfigure : Flush handlers] *******************************
                out: 
                out: RUNNING HANDLER [sap_general_preconfigure : Reboot the managed node] ***********
                out: Shared connection to restqe01 closed.
            cmd: rsync --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: dnf --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: rpm-ostree --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: yum install -y rsync
            err: ssh: connect to host restqe01 port 22: Connection refused

happz added a commit that referenced this issue May 2, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
happz added a commit that referenced this issue May 2, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
happz added a commit that referenced this issue May 3, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
psss pushed a commit that referenced this issue May 3, 2024
To accomodate tests that can be restarted without a reboot, e.g. after a
test crash or kernel panic, tmt would keep a test restart counter. It is
similar to the well-established `TMT_REBOOT_COUNT`, but tracks test
restarts. For most of the tests, both counters would have the same
value.

Related to #2696
@pablmart
Copy link

pablmart commented May 3, 2024

The MR 2870 solves the issue with the kernel-stack-overflow-underflow-scribbling test. Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants