Skip to content

Commit

Permalink
x86/mce: Defer processing of early errors
Browse files Browse the repository at this point in the history
[ Upstream commit 3bff147 ]

When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.

During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.

But there is a problem. In:

  5de97c9 ("x86/mce: Factor out and deprecate the /dev/mcelog driver")

the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.

This was partially fixed by:

  cd9c57c ("x86/MCE: Dump MCE to dmesg if no consumers")

which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.

Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.

 [ Based on an original patch, commit message by and completely
   productized by Tony Luck. ]

Fixes: 5de97c9 ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
  • Loading branch information
bp3tk0v authored and gregkh committed Sep 15, 2021
1 parent 7d69d7e commit 287c4b8
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 3 deletions.
1 change: 1 addition & 0 deletions arch/x86/include/asm/mce.h
Expand Up @@ -265,6 +265,7 @@ enum mcp_flags {
MCP_TIMESTAMP = BIT(0), /* log time stamp */
MCP_UC = BIT(1), /* log uncorrected errors */
MCP_DONTLOG = BIT(2), /* only clear, don't log */
MCP_QUEUE_LOG = BIT(3), /* only queue to genpool */
};
bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b);

Expand Down
11 changes: 8 additions & 3 deletions arch/x86/kernel/cpu/mce/core.c
Expand Up @@ -817,7 +817,10 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
goto clear_it;

mce_log(&m);
if (flags & MCP_QUEUE_LOG)
mce_gen_pool_add(&m);
else
mce_log(&m);

clear_it:
/*
Expand Down Expand Up @@ -1630,10 +1633,12 @@ static void __mcheck_cpu_init_generic(void)
m_fl = MCP_DONTLOG;

/*
* Log the machine checks left over from the previous reset.
* Log the machine checks left over from the previous reset. Log them
* only, do not start processing them. That will happen in mcheck_late_init()
* when all consumers have been registered on the notifier chain.
*/
bitmap_fill(all_banks, MAX_NR_BANKS);
machine_check_poll(MCP_UC | m_fl, &all_banks);
machine_check_poll(MCP_UC | MCP_QUEUE_LOG | m_fl, &all_banks);

cr4_set_bits(X86_CR4_MCE);

Expand Down

0 comments on commit 287c4b8

Please sign in to comment.