Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of __noinit with ecc memory hangs system #39598

Closed
cfriedt opened this issue Oct 20, 2021 · 7 comments
Closed

use of __noinit with ecc memory hangs system #39598

cfriedt opened this issue Oct 20, 2021 · 7 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale

Comments

@cfriedt
Copy link
Member

cfriedt commented Oct 20, 2021

Describe the bug
The __noinit attribute is used primarily for the purposes listed below:

  1. to preserve a section of SRAM across reboots
  2. to speedup init time by reducing the size of .bss

Some configurations may also use __noinit to skip initialization of memory regions initialized by other application-specific hardware.

There have been two identified areas where memory regions marked as __noinit are read-from before they are written-to - specifically, the shell and logging subsystems. Issuing a read before a write might be reasonable in certain scenarios, but for some systems with ECC-enabled memory controllers, a read operation before a write operation will cause the memory controller to issue a fault, hanging the system.

To Reproduce

  1. obtain a system with ECC memory and an ECC-capable memory controller
  2. enabling logging (with deferred mode) and / or the shell
  3. log messages
  4. observe hang

In the above cases, the system is completely hung before the boot banner was even printed.

Expected behavior
The system is not hung.

Impact
Showstopper

Logs and console output
There are no logs, nor is there any output on the console. It may be possible to find a stacktrace.

Environment (please complete the following information):

  • OS: Zephyr v2.6.0
  • Toolchain: crosstool-ng / riscv64
  • Commit SHA or Version used: 79a6c07

Additional context
This problem cannot be reproduced on non-ECC systems, and specifically, the ECC memory controller must also be configured to check SRAM memory regions. It is not obvious if there are any readily-available Zephyr community boards that include an ECC memory controller. This may highlight previously unknown race conditions, but those fixes may be considerably more involved. Therefore, the linked PR is mainly for mitigation purposes.

@cfriedt cfriedt added the bug The issue is a bug, or the PR is fixing a bug label Oct 20, 2021
@cfriedt cfriedt self-assigned this Oct 20, 2021
cfriedt added a commit to cfriedt/zephyr that referenced this issue Oct 21, 2021
The `__noinit` option is useful for
1. preserving sram across reboots
2. reducing the size of `.bss` and speeding-up init

However, some systems (such as those with ECC memory) can
fault / hang when reading from previously unwritten memory.

Default `CONFIG_NOINIT` to `y` to preserve existing behaviour.

Fixes zephyrproject-rtos#39598

Signed-off-by: Christopher Friedt <chrisfriedt@gmail.com>
@cfriedt cfriedt added the dev-review To be discussed in dev-review meeting label Oct 21, 2021
@evgeniy-paltsev
Copy link
Collaborator

Hi @cfriedt

There have been two identified areas where memory regions marked as __noinit are read-from before they are written-to - specifically, the shell and logging subsystems.

Do you know exact places where it happens? I can't imagine the reason for shell / logging for reading from __noinit before writing to it.

@cfriedt
Copy link
Member Author

cfriedt commented Oct 21, 2021

Hi @cfriedt

There have been two identified areas where memory regions marked as __noinit are read-from before they are written-to - specifically, the shell and logging subsystems.

Do you know exact places where it happens? I can't imagine the reason for shell / logging for reading from __noinit before writing to it.

I can try to find some details (e.g. stack traces) but it may be confidential. Will try my best.

Alternatively, it may be possible to try and trigger faults using MMU / MPU regions.

@cfriedt cfriedt added the has-pr label Oct 26, 2021
@galak
Copy link
Collaborator

galak commented Oct 26, 2021

Look at doing something similar to z_bss_zero to just init all of memory for these kinda of systems. Changing __noinit doesn't feel like the right solution.

@cfriedt cfriedt added priority: low Low impact/importance bug and removed dev-review To be discussed in dev-review meeting labels Oct 26, 2021
@cfriedt cfriedt removed the has-pr label Nov 3, 2021
@github-actions
Copy link

github-actions bot commented Jan 3, 2022

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Jan 3, 2022
@cfriedt cfriedt reopened this Jul 19, 2022
@cfriedt cfriedt removed the Stale label Jul 19, 2022
cfriedt added a commit to cfriedt/zephyr that referenced this issue Jul 22, 2022
Add `CONFIG_ZERO_NOINIT_MEMORY` to explicitly zero
memory regions marked `__noinit` at startup.

This is useful for systems with ECC memory.

Fixes zephyrproject-rtos#39598

Signed-off-by: Christopher Friedt <cfriedt@fb.com>
@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Sep 18, 2022
@cfriedt cfriedt removed the Stale label Sep 18, 2022
@cfriedt
Copy link
Member Author

cfriedt commented Sep 18, 2022

The solution suggested in #48172 is that the solution should be part of a specific memory controller driver implementation.

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale
Projects
None yet
3 participants