Skip to content

ghenvtest#2

Closed
rppt wants to merge 1 commit into
masterfrom
ghenv
Closed

ghenvtest#2
rppt wants to merge 1 commit into
masterfrom
ghenv

Conversation

@rppt
Copy link
Copy Markdown
Owner

@rppt rppt commented May 9, 2026

No description provided.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
@rppt rppt closed this May 9, 2026
rppt pushed a commit that referenced this pull request Jun 2, 2026
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.1, take #2

- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
  that the EL3 firmware does its part of the job. KVM will refuse to
  initialise otherwise.

- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
  running under NV with E2H==0.

- Correctly deal with permission faults in guest_memfd memslots.

- Fix the steal-time selftest after the infrastructure was reworked.

- Make sure the host cannot pass a non-sensical clock update to the
  EL2 tracing infrastructure.

- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
  ability to run arm64 guests, which will inevitably lead to arm64
  code being directly used on s390.

- Make sure that EL2 is configured with both exception entry and exit
  being Context Synchronization Events.

- Handle the current vcpu being NULL on EL2 panic.

- Fix the selftest_vcpu memcache being empty at the point of donation or
  sharing.

- Check that the memcache has enough capacity before engaging on the
  share/donate path.

- Fix __deactivate_fgt() to use its parameter rather than a variable
  in the macro context.
rppt pushed a commit that referenced this pull request Jun 2, 2026
'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():

  # tc qdisc add dev crash0 handle 1: root dualpi2
  BUG: kernel NULL pointer dereference, address: 0000000000000010
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP PTI
  CPU: 4 UID: 0 PID: 471 Comm: tc Tainted: G            E       7.1.0-rc1-virtme #2 PREEMPT(full)
  Tainted: [E]=UNSIGNED_MODULE
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: 0010:hrtimer_active+0x39/0x60
  Code: f9 eb 23 0f b6 41 38 3c 01 0f 87 87 64 c0 ff 83 e0 01 75 33 48 39 4a 50 74 28 44 3b 42 10 75 06 48 3b 51 30 74 21 48 8b 51 30 <44> 8b 42 10 41 f6 c0 01 74 cf f3 90 44 8b 42 10 41 f6 c0 01 74 c3
  RSP: 0018:ffffd0db80b93620 EFLAGS: 00010282
  RAX: ffffffffc0400320 RBX: ffff8cf24a4c86b8 RCX: ffff8cf24a4c86b8
  RDX: 0000000000000000 RSI: ffff8cf2429c2ab0 RDI: ffff8cf24a4c86b8
  RBP: 00000000fffffff4 R08: 0000000000000003 R09: 0000000000000000
  R10: 0000000000000001 R11: ffff8cf24a39c500 R12: ffff8cf24822c000
  R13: ffffd0db80b936c0 R14: ffffffffc02cf360 R15: 00000000ffffffff
  FS:  00007fbc01706580(0000) GS:ffff8cf2dc759000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000010 CR3: 0000000008e02003 CR4: 0000000000172ef0
  Call Trace:
   <TASK>
   hrtimer_cancel+0x15/0x40
   dualpi2_destroy+0x20/0x40 [sch_dualpi2]
   qdisc_create+0x230/0x570
   tc_modify_qdisc+0x716/0xc10
   rtnetlink_rcv_msg+0x188/0x780
   netlink_rcv_skb+0xcd/0x150
   netlink_unicast+0x1ba/0x290
   netlink_sendmsg+0x242/0x4d0
   ____sys_sendmsg+0x39e/0x3e0
   ___sys_sendmsg+0xe1/0x130
   __sys_sendmsg+0xad/0x110
   do_syscall_64+0x14f/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7fbc0188b08e
  Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
  RSP: 002b:00007fff593260e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc0188b08e
  RDX: 0000000000000000 RSI: 00007fff59326190 RDI: 0000000000000003
  RBP: 00007fff593260f0 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000202 R12: 000055f06124f260
  R13: 0000000069fca043 R14: 000055f061255640 R15: 000055f06124d3f8
   </TASK>
  Modules linked in: sch_dualpi2(E)
  CR2: 0000000000000010

[1] https://lore.kernel.org/netdev/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/20200725201707.16909-1-xiyou.wangcong@gmail.com/

v2:
 - rebased on top of latest net.git

Fixes: 320d031 ("sched: Struct definition and parsing of dualpi2 qdisc")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/1faca91179702b31da5d87653e1e036543e32722.1778259798.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot.  Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.

The safety mechanism depends on the origin of the reset.  For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event.  In that case no sleep is possible as a device halt is reported
at the hardirq level.

A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:

  defza: v.1.1.4  Oct  6 2018  Maciej W. Rozycki
  tc2: DEC FDDIcontroller 700 or 700-C at 0x18000000, irq 4
  tc2: resetting the board...
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-dirty #2 VOLUNTARY
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 7070617773203a6d
          0000000000000000 9800000002027ba4 0000000000001000 6465746e69617420
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff800ab7c0>] do_idle+0x40/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: OK
  tc2: model 700 (DEFZA-AA), MMF PMD, address 08-00-2b-xx-xx-xx
  tc2: ROM rev. 1.0, firmware rev. 1.2, RMC rev. A, SMT ver. 1
  tc2: link unavailable
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W           7.0.0-dirty #2 VOLUNTARY
  Tainted: [W]=WARN
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 0000000000000000
          0000000000000000 9800000002027ba4 0000000000001000 0000000000000000
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff806de8a4>] arch_local_irq_disable+0x4/0x28
  [<ffffffff800ab7d0>] do_idle+0x50/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: registered as fddi0

The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to.  Therefore there is a
very remote issue only this is a sign of.

Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.

Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean).  It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection.  For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).

Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.

Fixes: 61414f5 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
Mathias Stearn reports that since v6.19, there are two big issues
affecting rseq:

(1) On arm64 specifically, rseq critical sections aren't aborted when
    they should be.

(2) The 'cpu_id_start' field is no longer written by the kernel in all
    cases it used to be, including some cases where TCMalloc depends on
    the kernel clobbering the field.

This patch fixes issue #1. This patch DOES NOT fix issue #2, which will
need to be addressed by other patches.

The arm64-specific brokenness is a result of commits:

  2fc0e4b ("rseq: Record interrupt from user space")
  39a1675 ("rseq: Optimize event setting")

The first commit failed to add a call to rseq_note_user_irq_entry() on
arm64. Thus arm64 never sets rseq_event::user_irq to record that it may
be necessary to abort an active rseq critical section upon return to
userspace. On its own, this commit had no functional impact as the value
of rseq_event::user_irq was not consumed.

The second commit relied upon rseq_event::user_irq to determine whether
or not to bother to perform rseq work when returning to userspace. As
rseq_event::user_irq wasn't set on arm64, this work would be skipped,
and consequently an active rseq critical section would not be aborted.

Fix this by giving arm64 syscall-specific entry/exit paths, and
performing the relevant logic in syscall and non-syscall paths,
including calling rseq_note_user_irq_entry() for non-syscall entry.

Currently arm64 cannot use syscall_enter_from_user_mode(),
syscall_exit_to_user_mode(), and irqentry_exit_to_user_mode(), due to
ordering constraints with exception masking, and risk of ABI breakage
for syscall tracing/audit/etc. For the moment the entry/exit logic is
left as arm64-specific, directly using enter_from_user_mode() and
exit_to_user_mode(), but mirroring the generic code.

I intend to follow up with refactoring/cleanup, as we did for kernel
mode entry paths in commit:

  041aa7a ("entry: Split preemption from irqentry_exit_to_kernel_mode()")

... which will allow arm64 to use the GENERIC_IRQ_ENTRY functions directly.

Fixes: 39a1675 ("rseq: Optimize event setting")
Reported-by: Mathias Stearn <mathias@mongodb.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/regressions/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com/
Link: https://patch.msgid.link/20260508142023.3268622-1-mark.rutland@arm.com
rppt pushed a commit that referenced this pull request Jun 2, 2026
…g-a-bridge-port'

Ido Schimmel says:

====================
bridge: mcast: Fix a possible use-after-free when removing a bridge port

Patch #1 fixes a possible use-after-free when removing a bridge port.

Patch #2 adds a test case that triggers the problem.

In net-next we can:

1. Add DEBUG_NET_WARN_ON_ONCE() when a port multicast context is
de-initialized while enabled.

2. When de-initializing a port multicast context, synchronously shutdown
all the timers that were initialized when the context was initialized.
====================

Link: https://patch.msgid.link/20260517121122.188333-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
Sashiko reported that the PF driver accepts arbitrary MAC address from
from VF mailbox messages without proper validation, creating a security
vulnerability [1].

In enetc_msg_pf_set_vf_primary_mac_addr(), the MAC address is extracted
directly from the message buffer (cmd->mac.sa_data) and programmed into
hardware via pf->ops->set_si_primary_mac() without any validity checks.
A malicious VF can configure a multicast, broadcast, or all-zero MAC
address. Therefore, a validation to check the MAC address provided by VF
is required.

However, simply checking the MAC address is not enough, because it also
has the potential TOCTOU race [2]: The code reads the MAC address from
the DMA buffer to validate it via is_valid_ether_addr(), if validation
passes, reads the same DMA buffer a second time when calling
enetc_pf_set_primary_mac_addr() to program the hardware. A malicious VF
can exploit this window by overwriting the MAC address in the DMA buffer
between the validation check and the hardware programming, bypassing the
validation entirely.

Therefore, allocate a local buffer in enetc_msg_handle_rxmsg() and copy
the message content from the DMA buffer via memcpy() before processing.
This ensures the PF operates on a stable snapshot that the VF cannot
modify.

Link: https://sashiko.dev/#/patchset/20260511080805.2052495-1-wei.fang%40nxp.com #1
Link: https://sashiko.dev/#/patchset/20260513103021.2190593-1-wei.fang%40nxp.com #2
Fixes: beb74ac ("enetc: Add vf to pf messaging support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20260520064421.91569-5-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see:

   ======================================================
   WARNING: possible circular locking dependency detected
   7.0.0-debug+ #43 Tainted: G        W
   ------------------------------------------------------
   kswapd0/82 is trying to acquire lock:
   ffff800080ec3870 (reservation_ww_class_acquire){+.+.}-{0:0}, at: msm_gem_shrinker_scan+0x17c/0x400 [msm]

   but task is already holding lock:
   ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (fs_reclaim){+.+.}-{0:0}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          fs_reclaim_acquire+0xd0/0xf0
          dma_resv_lockdep+0x224/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          dma_resv_lockdep+0x1a8/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}:
          check_prev_add+0x114/0x790
          validate_chain+0x594/0x6f0
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          drm_gem_lru_scan+0x1ac/0x440
          msm_gem_shrinker_scan+0x17c/0x400 [msm]
          do_shrink_slab+0x150/0x4a0
          shrink_slab+0x144/0x460
          shrink_one+0x9c/0x1b0
          shrink_many+0x27c/0x5c0
          shrink_node+0x344/0x550
          balance_pgdat+0x2c0/0x988
          kswapd+0x11c/0x318
          kthread+0x10c/0x128
          ret_from_fork+0x10/0x20

   other info that might help us debug this:
   Chain exists of:
     reservation_ww_class_acquire --> reservation_ww_class_mutex --> fs_reclaim
    Possible unsafe locking scenario:
          CPU0                    CPU1
          ----                    ----
     lock(fs_reclaim);
                                  lock(reservation_ww_class_mutex);
                                  lock(fs_reclaim);
     lock(reservation_ww_class_acquire);

    *** DEADLOCK ***
   1 lock held by kswapd0/82:
    #0: ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   stack backtrace:
   CPU: 4 UID: 0 PID: 82 Comm: kswapd0 Tainted: G        W           7.0.0-debug+ #43 PREEMPT(full)
   Tainted: [W]=WARN
   Hardware name: LENOVO 21BX0016US/21BX0016US, BIOS N3HET94W (1.66 ) 09/15/2025
   Call trace:
    show_stack+0x20/0x40 (C)
    dump_stack_lvl+0x9c/0xd0
    dump_stack+0x18/0x30
    print_circular_bug+0x114/0x120
    check_noncircular+0x178/0x198
    check_prev_add+0x114/0x790
    validate_chain+0x594/0x6f0
    __lock_acquire+0x4d0/0xad0
    lock_acquire.part.0+0xc4/0x248
    lock_acquire+0x8c/0x248
    drm_gem_lru_scan+0x1ac/0x440
    msm_gem_shrinker_scan+0x17c/0x400 [msm]
    do_shrink_slab+0x150/0x4a0
    shrink_slab+0x144/0x460
    shrink_one+0x9c/0x1b0
    shrink_many+0x27c/0x5c0
    shrink_node+0x344/0x550
    balance_pgdat+0x2c0/0x988
    kswapd+0x11c/0x318
    kthread+0x10c/0x128
    ret_from_fork+0x10/0x20

kswapd0 holding fs_reclaim calls the MSM shrinker, which calls
dma_resv_lock. This in turn acquires fs_reclaim.

Fix this deadlock by using dma_resv_trylock() instead, dropping the
subsequently unused passed wait-wound lock 'ticket'.

Cc: stable@vger.kernel.org
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Fixes: fe4952b ("drm/msm: Convert vm locking")
Patchwork: https://patchwork.freedesktop.org/patch/723564/
Message-ID: <20260508065722.18785-1-daniel@quora.org>
[rob: fixup compile errors, replace lockdep splat with something legible]
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
rppt pushed a commit that referenced this pull request Jun 2, 2026
…softlockup

We hit a real softlockup in an internal stress test environment.  The
workload was LTP memory/swap stress on a large arm64 machine, with 320
CPUs, about 1TB memory and an 8.6GB swap device.  The system was under
heavy load and the swap device had a large number of full clusters.  The
softlockup was triggered during a stress test after about 3 days.

So, add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.

Detailed call trace as follow:

PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com
Fixes: 5168a68 ("mm, swap: avoid over reclaim of full clusters")
Signed-off-by: Zijiang Huang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: albinwyang <albinwyang@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant