Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem clearing software breakpoints with other halted hart on smp #893

Open
wxjstz opened this issue Jul 20, 2023 · 25 comments
Open

Problem clearing software breakpoints with other halted hart on smp #893

wxjstz opened this issue Jul 20, 2023 · 25 comments

Comments

@wxjstz
Copy link
Contributor

wxjstz commented Jul 20, 2023

This patch(39a4f37) introduces a problem. After clearing the software breakpoint, need to run fence.i to refresh the instruction cache. This instruction can only be run by the hart with the breakpoint set.

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 20, 2023

@eosea @timsifive

@TommyMurphyTM1234
Copy link
Collaborator

TommyMurphyTM1234 commented Jul 20, 2023

But Zifencei is an optional extension so OpenOCD can't assume that the target supports the fence.i instruction?

@timsifive
Copy link
Collaborator

Can you elaborate a bit more on what exactly the problem case is?

OpenOCD already tries to execute fence.i whenever a hart is resumed, so I don't see how you can end up in a situation where the "wrong" hart clearing a software breakpoint causes a problem.

@TommyMurphyTM1234, you are right, and OpenOCD has always just executed fence.i anyway. I imagine we'll fix that when someone is motivated or it breaks on hardware.

@TommyMurphyTM1234
Copy link
Collaborator

TommyMurphyTM1234 commented Jul 20, 2023

@TommyMurphyTM1234, you are right, and OpenOCD has always just executed fence.i anyway. I imagine we'll fix that when someone is motivated or it breaks on hardware.

In practice that's probably ok until (if it ever happens) someone has an implementation that doesn't support Zifencei/fence.i. I might have been mixing this up a bit with the confusion that arises when people build the toolchain and omit to build multilibs for rvXX..._zifencei ever since Zifencei (and Zicsr) was split out from the base integer ISA. 🙂

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 21, 2023

RV32G/RV64G are general purpose platforms, G includes IMAFD_Zicsr_Zifencei. so general purpose platforms support Zifencei extensions, we may need to test if the target platform supports Zifencei or add the option. But the patch to clear software breakpoints through available cores doesn't just affect RISC-V, I believe other platforms have cache synchronization issues as well.

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 21, 2023

Although Zifencei is an optional extension, it often appears, and only some simple embedded platforms do not need to implement this extension.

@TommyMurphyTM1234
Copy link
Collaborator

TommyMurphyTM1234 commented Jul 21, 2023

Apologies for going off topic on the Zifencei/fence.i issue.
I was not aware earlier that OpenOCD already uses (depends on?) this extension/instruction.
The wider issue is what extensions/instructions OpenOCD depends on the target supporting.
(In my opinion this should ideally be the base I/E integer ISAs with graceful degradation in service where other "desirable" extensions are not available).
And this issue may have arisen already in other contexts - including the wider context of the ongoing RISC-V debug specification.
Either way, this is a separate issue from the one in hand so please ignore it here from now on. :-)

Edit: I see that the issue was previously mentioned here:

And there's a potentially related but wider in scope discussion here:

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 24, 2023

What I mean is: this patch introduces bugs and needs to be removed. Whether the Zifencei extension is supported is another question.

@TommyMurphyTM1234
Copy link
Collaborator

What I mean is: this patch introduces bugs and needs to be removed. Whether the Zifencei extension is supported is another question.

But you haven't addressed @timsifive's question here:

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 28, 2023

Can you elaborate a bit more on what exactly the problem case is?

OpenOCD already tries to execute fence.i whenever a hart is resumed, so I don't see how you can end up in a situation where the "wrong" hart clearing a software breakpoint causes a problem.

Assume that the software breakpoint of hart0 is cleared through hart1, hart0 is running, and hart1 modifies the memory. The memory modified by hart1 may not be synchronized to the instruction cache of hart0. hart0 itself is needed to run fence.i.

@TommyMurphyTM1234, you are right, and OpenOCD has always just executed fence.i anyway. I imagine we'll fix that when someone is motivated or it breaks on hardware.

@timsifive
Copy link
Collaborator

Assume that the software breakpoint of hart0 is cleared through hart1, hart0 is running, and hart1 modifies the memory. The memory modified by hart1 may not be synchronized to the instruction cache of hart0. hart0 itself is needed to run fence.i.

That will only happen in unusual circumstances. We're looking for harts within an SMP group, and OpenOCD tries to keep all harts in a single SMP group in the same state. If it's not successful, then at best gdb is going to get confused. Having one hart in an SMP group running while another is halted I don't think is a supported use case. (It could happen because the debugger doesn't have complete control over a target, but if things change spontaneously on the target there are all kinds of possible problems.)

@wxjstz
Copy link
Contributor Author

wxjstz commented Jul 29, 2023

Assume that the software breakpoint of hart0 is cleared through hart1, hart0 is running, and hart1 modifies the memory. The memory modified by hart1 may not be synchronized to the instruction cache of hart0. hart0 itself is needed to run fence.i.

That will only happen in unusual circumstances. We're looking for harts within an SMP group, and OpenOCD tries to keep all harts in a single SMP group in the same state. If it's not successful, then at best gdb is going to get confused. Having one hart in an SMP group running while another is halted I don't think is a supported use case. (It could happen because the debugger doesn't have complete control over a target, but if things change spontaneously on the target there are all kinds of possible problems.)

39a4f37#diff-926a25c5eb17bac57cb7ba08fb85a02d9ffe1df033fd3336368b906ea3672805R373-R379
This modification attempts to use a halted hart to clear a running hart's software breakpoint.

@timsifive
Copy link
Collaborator

Yes, and the only way I'm aware of that code actually being executed is if the running hart became running/unavailable without the debugger requesting it. What do you think should happen in such a case?

@wxjstz
Copy link
Contributor Author

wxjstz commented Aug 1, 2023

+--------------+      +----------------+
| running hart |      | available hart |
+--------------+      +----------------+
        ↑fetch                ↓write (clean software breakpoint)
   +---------+             +---------+
   | I-Cache |             | D-Cache |
   +---------+             +---------+
        ↕                      ↕
        .                       .
        .                       .
        .                       .
+-----------------------------------------+
|               memory                    |
+-----------------------------------------+

FENCE.I instruction that provides explicit synchronization between writes to instruction memory and instruction fetches on the same hart.

Therefore, using available hart instead of running hart to clear software breakpoints has the problem of cache synchronization. Maybe available hart only writes instructions to its own D-Cache, and the I-Cache of running hart is not updated.

@timsifive
Copy link
Collaborator

  1. That situation doesn't occur during normal operation, because OpenOCD will ensure that harts in the same SMP group are all halted or all running.
  2. If it does happen, what should OpenOCD do?

@wxjstz
Copy link
Contributor Author

wxjstz commented Aug 1, 2023

The I-Cache, unlike the D-Cache, does not require frequent synchronization with the next level of memory, and only needs to be loaded when the cache is not hit.

When this happens there are two result:

  1. if the address where the software breakpoint is located is cached by I-Cache, then clearing the software breakpoint will fail
  2. if the address where the software breakpoint is located is not cached by I-Cache, then clearing the software breakpoint can succeed.

@aap-sc
Copy link
Collaborator

aap-sc commented Aug 1, 2023

Could it be the case that such HW configurations should not use SMP configuration at all? I believe such configurations have difficulties even with setting SW breakpoints. The problem is the same - we set SW breakpoint only from 1 hart.

@timsifive
Copy link
Collaborator

@aap-sc, when setting SW breakpoints generally all harts are halted. OpenOCD writes the breakpoint through one hart, and then has all of them execute some fence instructions which should let the caches synchronize.

@aap-sc
Copy link
Collaborator

aap-sc commented Aug 1, 2023

execute some fence instructions which should let the caches synchronize.

But does execution of fence instruction guarantee that caches are always synchronized? I mean, if I understand correctly - such synchronization is only guaranteed if some load from say hart1 actually observes a store from hart0. Which may never happen (meaning hart1 could read a previos value even if hart0 issued a fence). Or is my understanding is incorrect?

@wxjstz
Copy link
Contributor Author

wxjstz commented Aug 2, 2023

@aap-sc
Copy link
Collaborator

aap-sc commented Aug 18, 2023

Each hart must perform write and fence.i

  1. So if each hart must perform a write and fence - then setting of SW breakpoints may not work in the current implementation, since AFAIK we set SW breakpoint only once for the active core.

  2. Moreover, this implies that functions like load_image need to write the same program for each hart separately, to be sure that memory is synchronized across cores. @timsifive do you know if there is any chance that debug spec somehow allows some relaxations regarding this?

@TommyMurphyTM1234
Copy link
Collaborator

@timsifive
Copy link
Collaborator

RISC-V and caches are still a bit behind feature completeness. This is in part because for a long time the official position was that caches are an implementation detail not visible to the ISA. There are more extensions focused on cache behavior these days. In any case, the problem of downloading a program through a debugger and executing a "process" on a different hart is identical, so the debugger should be able to do whatever software does, because it can execute all the instructions that software can. This is a long way of saying I don't think there will be any changes to the debug spec to handle caching using the program buffer.

@aap-sc
Copy link
Collaborator

aap-sc commented Aug 31, 2023

@TommyMurphyTM1234

Is this article of any relevance in this context?

Thanks for the article. However, it focuses mostly on cache-maintenance operations issued by the same core. I guess that the current implementation in OpenOCD is fine, if the hart supports ifence. My understanding that the issue discussed here is mostly about a situation when we need to maintain some kind of i-cache coherence when we have several cores.

This is a long way of saying I don't think there will be any changes to the debug spec to handle caching using the program buffer.

Got it, thanks.

Sorry for the ratholing.

However, as per @wxjstz comment:

Each hart must perform write and fence.i

This means that the current implementation is not quite correct. Looks like when a SW breakpoint is set or cleared the memory should be updated by all harts. This is not what OpenOCD does currenly. Should this be fixed?

@aap-sc
Copy link
Collaborator

aap-sc commented Aug 31, 2023

This means that the current implementation is not quite correct. Looks like when a SW breakpoint is set or cleared the memory should be updated by all harts. This is not what OpenOCD does currenly. Should this be fixed?

Nevermind. My bad. Looks like I'm wrong. Sorry for the potential confusion. If we set a breakpoint then this operation is performed only by the active hart, and this is by design. So the potential issue is only for the case when we clear the breakpoint in case when the original hart is unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants