Second attempt to fix the CPU startup failure #4591

pelwell · 2021-09-20T11:13:44Z

ARM: proc-v7: Retry uncached stmia if necessary

A failure of some CPU cores to come online has been traced to the
failure of a stm instruction while the cache is disabled. The symptom
is that the saved values read back as zeroes, a catastrophic error since
one of the values is a return address.

This patch forces a readback and retry until the correct value is
returned,

Notes:

At this stage in the boot process the core is running with its cache
disabled. Before enabling the cache its contents must be explicitly
invalidated, a process that requires quite a few registers that the
caller must preserve. Evidence suggests that something is writing a
block of zeroes over that space at a time when all other cores should
be idle, possibly some kind of write-combiner, and retrying is an
attempt to avoid the problem.

The previous attempted fix (forcing the accesses to only be 4-byte
aligned) appears to have only worked for a while and likely for less
obvious reasons such as a change in code alignment.

See: Hexxeh/rpi-firmware#232

This reverts commit fe4cc0e. The speculative patch that this commit reverts is proving to not be effective any more, so revert it and try a new approach. Signed-off-by: Phil Elwell <phil@raspberrypi.com>

A failure of some CPU cores to come online has been traced to the failure of a stm instruction while the cache is disabled. The symptom is that the saved values read back as zeroes, a catastrophic error since one of the values is a return address. This patch forces a readback and retry until the correct value is returned, Notes: At this stage in the boot process the core is running with its cache disabled. Before enabling the cache its contents must be explicitly invalidated, a process that requires quite a few registers that the caller must preserve. Evidence suggests that something is writing a block of zeroes over that space at a time when all other cores should be idle, possibly some kind of write-combiner, and retrying is an attempt to avoid the problem. The previous attempted fix (forcing the accesses to only be 4-byte aligned) appears to have only worked for a while and likely for less obvious reasons such as a change in code alignment. See: Hexxeh/rpi-firmware#232 Signed-off-by: Phil Elwell <phil@raspberrypi.com>

pelwell · 2021-09-20T13:35:44Z

@popcornmix has reminded me that the startup code was modified in 5.13 so as to no longer need to spill the registers to the stack. The changes are covered by this patchset (https://www.spinics.net/lists/arm-kernel/msg874939.html), comprising the following commits:

c0e5073 ARM: 9057/1: cache-v7: add missing ISB after cache level selection
f9e7a99 ARM: 9058/1: cache-v7: refactor v7_invalidate_l1 to avoid clobbering r5/r6
95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack

A quick reading suggested that applying these patches on top of the reversion of my first patch would provide an alternative fix for the problem, but so far I've been unable to get it to boot.

pelwell · 2021-09-20T15:41:30Z

This patch fixes the problem for me to a reasonable level of confidence; I've seen 5+ CPU1 failures on a rev 1.1 2B running top-of-tree rpi-5.10.y (the failure mode also takes out USB) in about half an hour of reboots, and nothing in over an hour (including some cold boots) with the retry patch.

JamesH65 · 2021-09-20T15:47:20Z

The only question I had was whether the retry will always eventually succeed, or is there the vaguest possible of it infinite looping?

pelwell · 2021-09-20T15:54:17Z

Doing nothing is the lowest risk.

JamesH65 · 2021-09-20T15:55:26Z

Well, we've seen reports in the field, so anything is better than nothing.

pelwell · 2021-09-20T15:56:32Z

More importantly, since this is code that runs on each of the secondary cores, an infinite loop will only cause the symptoms people are seeing now (CPUx failed to come online) but without the horrible side-effects caused by a branch to zero...

See: raspberrypi/linux#4591

pelwell · 2021-09-21T10:29:11Z

The only question I had was whether the retry will always eventually succeed

I have a partial answer to that, having seen one failure (CPU2) out of 250 reboots - the primary core gives up waiting after a short interval and marks the secondary as offline. The difference with the patched kernel, apart from the reduced incidence, is that USB was functional after the failure.

JamesH65 · 2021-09-21T10:44:52Z

Sounds like a win to me. Can we increase the primary core timeout?

pelwell · 2021-09-21T11:38:07Z

Retesting now with a 10s timeout (was 1s).

pelwell · 2021-09-21T14:35:17Z

Just got another CPU2 failure, despite the 10s timeout - it's dead, Jim.

lurch · 2021-09-23T10:26:52Z

I'm afraid I know sod-all about ARM assembler, but just out of curiosity I googled stmia and found https://developer.arm.com/documentation/ddi0406/b/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/STM---STMIA---STMEA?lang=en which links to https://developer.arm.com/documentation/ddi0406/b/System-Level-Architecture/System-Instructions/Alphabetical-list-of-instructions/STM--user-registers-?lang=en which in turn says "Store Multiple (user registers) is unpredictable in User or System modes."
Could that explain the unpredictability you're seeing here, or is that referring to something else entirely? (I also know nothing about different CPU-modes)

KanjiMonster · 2021-10-04T16:07:14Z

@popcornmix has reminded me that the startup code was modified in 5.13 so as to no longer need to spill the registers to the stack. The changes are covered by this patchset (https://www.spinics.net/lists/arm-kernel/msg874939.html), comprising the following commits:

c0e5073 ARM: 9057/1: cache-v7: add missing ISB after cache level selection
f9e7a99 ARM: 9058/1: cache-v7: refactor v7_invalidate_l1 to avoid clobbering r5/r6
95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack

A quick reading suggested that applying these patches on top of the reversion of my first patch would provide an alternative fix for the problem, but so far I've been unable to get it to boot.

For us it boots (applied on top of 5.10.70), but with this the second CPU consistently fails to come up (4 times out of 4 tries):

smp: Bringing up secondary CPUs ...
CPU1: failed to come online
smp: Brought up 1 node, 1 CPU
SMP: Total of 1 processors activated (1000.00 BogoMIPS).

95731b8 ARM: 9059/1: cache-v7: get rid of mini-stack is what seems to be breaking stuff, and not applying this one makes it come up successfully (though totally negating the intention). At a first glance don't see why though.

So maybe there are some additional changes needed.

popcornmix · 2021-10-04T16:15:53Z

Yes, 5.13/5.14/5.15 include "get rid of mini-stack" and boot, and never get the CPU startup failure.
So there definitely is something between 5.10 and 5.13 that makes it work.

The good news is the CPU startup failure will go when we bump to the next LTS kernel (but that will be some way off).

KanjiMonster · 2021-10-04T16:18:41Z

Good to know, then we'll maybe just sit this out until the next LTS kernel.

KanjiMonster · 2021-10-27T14:37:38Z

After booting a lot of kernels, I identified the missing patch series..es:

Applying commits torvalds/linux@4e79f02 to torvalds/linux@9443076 (11 commits) and torvalds/linux@67e3f82 to torvalds/linux@aaac373 (9 commits) on top of 5.10 makes the mini-stack removal actually work as expected on top of 5.10.

A few of these commits have already been backported to 5.10 stable. I did not try to create a minimal patchset that makes it work, but having both series backported definitely works at a first glance.

pelwell · 2021-10-28T11:42:05Z

Thank you very much for doing that - it's been bothering me, since even with this second patch I have seen one or two failures. As you say, some of those patches have already been back-ported, leaving 17 additional commits and a reversion of my workaround, which are all now in the rpi-5.10y as of 462a5e2.

kernel: clk: Move vec clock to clk-raspberrypi See: raspberrypi/linux#4639 kernel: ARM: v7: get rid of boot time mini stack See: raspberrypi/linux#4591 kernel: HVS Gamma LUT support for the RPi4 See: raspberrypi/linux#4435 kernel: ARM: dts: Add Pi Zero 2 support firmware: arm_loader: Allow VEC clock to be controlled by arm

popcornmix · 2021-10-28T13:35:56Z

The upstream form of fix is now in rpi-update kernel. Would be useful if anyone experiencing the issue can test and report back. Hopefully we can then get it into stable releases soon.

KanjiMonster · 2021-11-09T12:45:15Z

We added a check after every reboot in our automated testing, and haven't seen a failure in about ~1.5k reboots.

pelwell added 2 commits September 20, 2021 12:08

Revert "ARM: proc-v7: Force misalignment of early stmia"

176b8e7

This reverts commit fe4cc0e. The speculative patch that this commit reverts is proving to not be effective any more, so revert it and try a new approach. Signed-off-by: Phil Elwell <phil@raspberrypi.com>

pelwell merged commit 3e1698e into raspberrypi:rpi-5.10.y Sep 20, 2021

popcornmix added a commit to raspberrypi/firmware that referenced this pull request Sep 20, 2021

kernel: Second attempt to fix the CPU startup failure

9f97081

See: raspberrypi/linux#4591

popcornmix added a commit to Hexxeh/rpi-firmware that referenced this pull request Sep 20, 2021

kernel: Second attempt to fix the CPU startup failure

beeeb4f

See: raspberrypi/linux#4591

pelwell deleted the cpustartfix branch October 5, 2021 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second attempt to fix the CPU startup failure #4591

Second attempt to fix the CPU startup failure #4591

pelwell commented Sep 20, 2021

pelwell commented Sep 20, 2021

pelwell commented Sep 20, 2021

JamesH65 commented Sep 20, 2021

pelwell commented Sep 20, 2021

JamesH65 commented Sep 20, 2021

pelwell commented Sep 20, 2021 •

edited

pelwell commented Sep 21, 2021

JamesH65 commented Sep 21, 2021

pelwell commented Sep 21, 2021

pelwell commented Sep 21, 2021

lurch commented Sep 23, 2021

KanjiMonster commented Oct 4, 2021

popcornmix commented Oct 4, 2021

KanjiMonster commented Oct 4, 2021

KanjiMonster commented Oct 27, 2021

pelwell commented Oct 28, 2021

popcornmix commented Oct 28, 2021

KanjiMonster commented Nov 9, 2021

Second attempt to fix the CPU startup failure #4591

Second attempt to fix the CPU startup failure #4591

Conversation

pelwell commented Sep 20, 2021

pelwell commented Sep 20, 2021

pelwell commented Sep 20, 2021

JamesH65 commented Sep 20, 2021

pelwell commented Sep 20, 2021

JamesH65 commented Sep 20, 2021

pelwell commented Sep 20, 2021 • edited

pelwell commented Sep 21, 2021

JamesH65 commented Sep 21, 2021

pelwell commented Sep 21, 2021

pelwell commented Sep 21, 2021

lurch commented Sep 23, 2021

KanjiMonster commented Oct 4, 2021

popcornmix commented Oct 4, 2021

KanjiMonster commented Oct 4, 2021

KanjiMonster commented Oct 27, 2021

pelwell commented Oct 28, 2021

popcornmix commented Oct 28, 2021

KanjiMonster commented Nov 9, 2021

pelwell commented Sep 20, 2021 •

edited