Skip to content

Commit

Permalink
smp,csd: Throw an error if a CSD lock is stuck for too long
Browse files Browse the repository at this point in the history
[ Upstream commit 94b3f0b ]

The CSD lock seems to get stuck in 2 "modes". When it gets stuck
temporarily, it usually gets released in a few seconds, and sometimes
up to one or two minutes.

If the CSD lock stays stuck for more than several minutes, it never
seems to get unstuck, and gradually more and more things in the system
end up also getting stuck.

In the latter case, we should just give up, so the system can dump out
a little more information about what went wrong, and, with panic_on_oops
and a kdump kernel loaded, dump a whole bunch more information about what
might have gone wrong.  In addition, there is an smp.panic_on_ipistall
kernel boot parameter that by default retains the old behavior, but when
set enables the panic after the CSD lock has been stuck for more than
the specified number of milliseconds, as in 300,000 for five minutes.

[ paulmck: Apply Imran Khan feedback. ]
[ paulmck: Apply Leonardo Bras feedback. ]

Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
  • Loading branch information
rikvanriel authored and gregkh committed Nov 28, 2023
1 parent d4d2297 commit 2ca0494
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 1 deletion.
7 changes: 7 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5781,6 +5781,13 @@
This feature may be more efficiently disabled
using the csdlock_debug- kernel parameter.

smp.panic_on_ipistall= [KNL]
If a csd_lock_timeout extends for more than
the specified number of milliseconds, panic the
system. By default, let CSD-lock acquisition
take as long as they take. Specifying 300,000
for this value provides a 5-minute timeout.

smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
Expand Down
13 changes: 12 additions & 1 deletion kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info);

static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */
module_param(csd_lock_timeout, ulong, 0444);
static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for five minutes. */
module_param(panic_on_ipistall, int, 0444);

static atomic_t csd_bug_count = ATOMIC_INIT(0);

Expand Down Expand Up @@ -228,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
}

ts2 = sched_clock();
/* How long since we last checked for a stuck CSD lock.*/
ts_delta = ts2 - *ts1;
if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
return false;
Expand All @@ -241,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
else
cpux = cpu;
cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
/* How long since this CSD lock was stuck. */
ts_delta = ts2 - ts0;
pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
cpu, csd->func, csd->info);
/*
* If the CSD lock is still stuck after 5 minutes, it is unlikely
* to become unstuck. Use a signed comparison to avoid triggering
* on underflows when the TSC is out of sync between sockets.
*/
BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC));
if (cpu_cur_csd && csd != cpu_cur_csd) {
pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
*bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
Expand Down

0 comments on commit 2ca0494

Please sign in to comment.