Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arch: riscv: fix hangup of multicore boot #65000

Merged

Conversation

katsuster
Copy link
Member

This patch fixes hangup of RISC-V multicore boot.
Currently boot sequence uses a riscv_cpu_wake_flag to notify wakeup request for secondary core(s).

But initial value of riscv_cpu_wake_flag is undefined, so current mechanism is going to hangup if riscv_cpu_wake_flag and mhartid of secondary core have the same value.

This is an example situation of this problem:

  • hart1: check riscv_cpu_wake_flag (value is 1) and end the loop
  • hart1: set riscv_cpu_wake_flag to 0
  • hart0: set riscv_cpu_wake_flag to 1 hart0 expects it will be changed to 0 by hart1 but it has never happened

Note:

  • hart0's mhartid is 0, hart1's mhartid is 1
  • hart0 is main, hart1 is secondary in this example

This patch fixes hangup of RISC-V multicore boot.
Currently boot sequence uses a riscv_cpu_wake_flag to notify wakeup
request for secondary core(s).

But initial value of riscv_cpu_wake_flag is undefined, so current
mechanism is going to hangup if riscv_cpu_wake_flag and mhartid of
secondary core have the same value.

This is an example situation of this problem:

- hart1: check riscv_cpu_wake_flag (value is 1) and end the loop
- hart1: set riscv_cpu_wake_flag to 0
- hart0: set riscv_cpu_wake_flag to 1
         hart0 expects it will be changed to 0 by hart1 but it
         has never happened

Note:
  - hart0's mhartid is 0, hart1's mhartid is 1
  - hart0 is main, hart1 is secondary in this example

Signed-off-by: Katsuhiro Suzuki <katsuhiro@katsuster.net>
Copy link
Member

@fkokosinski fkokosinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carlescufi carlescufi merged commit 4ce5f7e into zephyrproject-rtos:main Nov 10, 2023
16 checks passed
@amr-sc
Copy link
Contributor

amr-sc commented Apr 8, 2024

Hi @katsuster ,

I'm sorry for such a late comment on your change.
I was away from Zephyr for some time but currently I'm back and inprogress with migration of our RISC-V codebase to recent Zephyr 3.6 release and stumbled across this fix.

Could you please clarify:

  1. under "riscv_cpu_wake_flag is undefined" do you mean the race condition between the Secondary core access to riscv_cpu_wake_flag and the Primary core zeroing-out of the .bss section where riscv_cpu_wake_flag is placed?
  2. If (1) is the case, then what do you think about minor optimization with explicit riscv_cpu_wake_flag initialization to -1 on declaration, this way we wouldn't introduce the extra sync variable and will eliminate the race condition described above?

@katsuster
Copy link
Member Author

Hello @amr-sc,

Thanks your comment. I think there is same problem even if riscv_cpu_wake_flag has initial value -1.
This race condition happens before initializing BSS or copying DATA from ROM area to RAM.

If someone reset the CPUs after overwriting riscv_cpu_wake_flag area (and unfortunately the value of it is same as one of mhardid), the hart cannot wait notification via wake_flag from main CPU and going to hangup.

@amr-sc
Copy link
Contributor

amr-sc commented Apr 12, 2024

Hello @amr-sc,

Thanks your comment. I think there is same problem even if riscv_cpu_wake_flag has initial value -1. This race condition happens before initializing BSS or copying DATA from ROM area to RAM.

If someone reset the CPUs after overwriting riscv_cpu_wake_flag area (and unfortunately the value of it is same as one of mhardid), the hart cannot wait notification via wake_flag from main CPU and going to hangup.

Hello @katsuster , thank you very much for your fast reply!

The issue you are highlighting here is very interesting. I was thinking that DATA section is always initialized before we start to execute the Zephyr code, since it is getting copied together with the TEXT to DDR. And the only race condition (on my system it is exactly the case) is possible when we initialize BSS from Primary core. So I'm a bit confused.

Could you please clarify the target scenario where we can have a DATA section race condition?
If I understand you correctly it is some kind of SW reset, where we should re-initialize the DATA section?

P.S.: to me it seems like current Zephyr RISC-V implementation is not ready for such scenario, e.g. we don't even have a code to copy DATA section for a single core..

@katsuster
Copy link
Member Author

Hello @arm-sc,

DATA section is copied and initialized by other boot loader before Zephyr running. If in this mode, using DATA section to set initial value to boot_flag is simple and nice solution.

But Zephyr also supports execution in place (XIP) mode by CONFIG_XIP = y. In this mode DATA section is stored in ROM and it is readonly. The kernel copy data from ROM to RAM in boot time.

RISC-V case:
__reset -> __initialize -> z_prep_c() -> z_data_copy()

If you have interest, please check CONFIG_XIP and z_data_copy() function at kernel/xip.c.

@amr-sc
Copy link
Contributor

amr-sc commented Apr 17, 2024

Hi @katsuster,

I didn't know about that functionality. Thanks a lot for flashing a light on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: RISCV RISCV Architecture (32-bit & 64-bit)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants