Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The llsc stack is not safe from ABA with how LL/SC is used #180

Open
rustonaut opened this issue Oct 18, 2020 · 10 comments
Open

The llsc stack is not safe from ABA with how LL/SC is used #180

rustonaut opened this issue Oct 18, 2020 · 10 comments

Comments

@rustonaut
Copy link

On the Cortex-M, STREX will always fail if the processor takes an exception between it and its corresponding LDREX operation

But the load of next at 8000136 is done outside of the LDREX...STREX sequence.

So if the process is preempted between 8000136 and 800013a and during the
preemption we cause the ABA problem (e.g. given h->A->B->C->0, do pop A, pop B, push A resulting in h->A->C->0 with B being in use).

Now given Cortex-M implementations might not have that problem:

A context switch might cause a subsequent Store-Exclusive to fail, requiring a load ... store sequence to be replayed. To minimize the possibility of this happening, ARM recommends that the Store-Exclusive instruction is kept as close as possible to the associated Load-Exclusive instruction, see Load-Exclusive and Store-Exclusive usage restrictions.

(quote from the ARM®v7-M ArchitectureReference Manual, linked in the sourcecode)

The problem is this is a might so depending on where you get your Cortex-M chip from this might or might not be a sound implementation.


https://github.com/japaric/heapless/blob/9ff3a5fa8968fb263cece456ccc9505dc913147e/src/pool/mod.rs#L123-L139

@rustonaut
Copy link
Author

rustonaut commented Oct 18, 2020

As a side not, I don't know which of the current sold Cortex-M based processors have that problem, neither do I know it there are any.

So this is a specification related bug, but it might not be a bug in practice. The problem is it's hard to tell if it is or is not a problem in practice.

@talchas
Copy link

talchas commented Sep 19, 2021

No, it's absolutely a problem in practice. The load of next is outside of the llsc region, so it doesn't matter about the "might" in the spec. A preemption before that (or concurrent accesses in actual SMP hardware) is not going to trigger an abort. Taking advantage of LL/SC like this is just not possible with C11-based intrinsics - you need the LL(head) part of compare_exchange_weak before the load of next. This is doable in ASM, but not in the typical rust/C/etc intrinsics. The asm shown in the docs is clearly shows how this is incorrect, in order for it to be ok it would need to have reordered the ldrex/lrd.w instructions (and then had different register allocation):

ldrex   r3, [r0] 
ldr.w   ip, [r2] 
cmp     r3, r1 
bne.n   800014a <<heapless::pool::Pool<T>>::pop+0x1a> 
strex   r3, ip, [r0] 

Meanwhile the warned-about x86-64 generation counter algorithm is safe in basically any realistic scenario.

@korken89
Copy link
Contributor

@japaric Do you know how to fix this?
Or should we just trash the use of atomics and to the critical move in an interrupt free section?
As I see it interrupt free is the way to go to get deterministic execution time and the section will be extremely small.

@japaric
Copy link
Member

japaric commented Sep 20, 2021

yes, I have written this stack before using ldrex and strex "intrinsics" (plain functions around inline assembly) before I think that was more correct (and shorter) than what one gets with compare_exchange. That requires a nightly toolchain because of the inline assembly and I no longer have that code around.

Or should we just trash the use of atomics and to the critical move in an interrupt free section?

not a fan of the interrupt::free suggestion. if we are going to make the code device specific, I would look into writing the push and pop subroutines in "external assembly", which works on stable (no time for any of that this week though)

@korken89
Copy link
Contributor

Hmm, what are you not a fan of? It will be a few instructions long (4?) and guaranteed execution time.
So more efficient than the current ldrex and strex code as the current loops and checks will go away as the operation is guaranteed to work.
For keeping test-ability on x86 we can have atomics based spin-wait (same as it is now).

I'm not seeing the problem you are seeing.

@japaric
Copy link
Member

japaric commented Sep 20, 2021

the documentation states this a lock-free queue. adding a critical section breaks that guarantee and would be a breaking change in my opinion.

spsc::Queue – single producer single consumer lock-free queue

@korken89
Copy link
Contributor

I think we disagree then, as long as you do not need mutable references it is lock free, as the pool works via immutable references.
And this API will be kept, so for my view it's not a breaking change.

I'd also argue that using atomics here is over-engineered, as it will be an extremely small critical section which in worst case will give a few clocks of predictable latency in the system.
And since the fix seems to entail custom assembly I'd argue even harder that atomics is not the correct thing to use here, given that we want future maintainability and simple understanding.
This means we can keep atomics, but have unbounded worst case and an advanced implementation, or use critical sections and have strictly bounded worst case (which is smaller than the best case with atomics) and very easy to understand code.

@japaric
Copy link
Member

japaric commented Sep 20, 2021

I think we disagree then, as long as you do not need mutable references it is lock free, as the pool works via immutable references. And this API will be kept, so for my view it's not a breaking change.

Mutation via shared references (&-) can be called 'interior mutability' and/or 'thread-safe' (= implements Sync). That's a different property than 'lock-free', which means free from 'critical sections' be these achieved via disabling interrupts or using any flavor of Mutex.

To be clear, I'm not opposed to adding a general purpose fixed-capacity queue that works with mutable references but that's a different data structure. People can layer device-specific synchronization on top of that queue to get what you are describing, e.g.

// crate: heapless
pub struct GeneralPurposeQueue<T> { /* .. */ }

impl GeneralPurposeQueue<T> {
    pub fn enqueue(&mut self, item: T) -> Result<(), T> { /* .. */ }
    pub fn dequeue(&mut self, item: T) -> Option<T> { /* .. */ }
}

// crate: cortex-m-queue
pub struct CortexMQueue<T>(interrupt::Mutex<RefCell<GeneralPurposeQueue<T>>>);

impl CortexMQueue<T> {
    pub fn enqueue(&self, item: T) -> Result<(), T> {
        interrupt::free(|cs| self.0.borrow(cs).borrow_mut().enqueue(item))
    }
}

I'd also argue that using atomics here is over-engineered

Bounded (fixed-capacity) lock-free SPSC queues based on atomics are a well-known data structure that date from ~80 (I think the original work is by Lamport). There's plenty of variants (bounded, unbounded, cache-friendly, etc.) by now but I would not call any of them "over enigneered"; they all have different trade offs.


The other problem I see with a critical section based queue that has no atomics is that it's not multi-core safe, where as the lock-free queue here is.

I see that your main line of argument is "bounded time execution". I can see the use case for that and I'm fully OK with a queue that has such property but like I said above that's a different data structure so we should not be modifying the existing spsc::Queue but rather adding a new queue that has such property.

@korken89
Copy link
Contributor

I think we are discussing 2 different things, spsc::Queue is fine as is. pool and its internal Trieber stack is the one with the issue. :)

@japaric
Copy link
Member

japaric commented May 2, 2022

now that asm! is stable the LL/SC version can be rewritten to use asm!. the optimized generated assembly (target=Cortex-M4F) for the Treiber stack is

; fn push(r0: &Self, r1: NonNull<Node>)
00000000 <treiber_stack::Stack::push>:
   0:   /-> e850 2f00   ldrex   r2, [r0]
   4:   |   600a        str     r2, [r1, #0]
   6:   |   e840 1200   strex   r2, r1, [r0]
   a:   |   2a00        cmp     r2, #0
   c:   \-- d1f8        bne.n   0 <treiber_stack::Stack::push>
   e:       4770        bx      lr

; fn pop(r0: &Self) -> Option<NonNull<Node>>
00000000 <treiber_stack::Stack::pop>:
   0:          4601             mov     r1, r0
   2:   /----> e851 0f00        ldrex   r0, [r1]
   6:   |  /-- b130             cbz     r0, 16 <treiber_stack::Stack::pop+0x16>
   8:   |  |   6802             ldr     r2, [r0, #0]
   a:   |  |   e841 2300        strex   r3, r2, [r1]
   e:   |  |   2b00             cmp     r3, #0
  10:   |  |   bf08             it      eq
  12:   |  |   4770             bxeq    lr
  14:   \--|-- e7f5             b.n     2 <treiber_stack::Stack::pop+0x2>
  16:      \-> 2000             movs    r0, #0
  18:          f3bf 8f2f        clrex
  1c:          4770             bx      lr

I'll prepare a PR

japaric added a commit that referenced this issue May 3, 2022
this changes the implementation of the underlying Treiber Stack to use LDREX and STREX on
Cortex-v7A/R/M. other architectures are left unchanged in this PR. in principle, RISC-V could be
implemented in a similar fashion but haven't tested that yet

this does bump the MSRV so it's technically a breaking change ...

to make this easier to maintain I'd like to drop the llsc.rs and make pool only available on targets
where it implements Sync *and* it's sound but that's a API breaking change. there are other API
breaking changes I'd like to make (e.g. remove the Uninit type state) so I'll write a RFC first

fixes #180
@japaric japaric mentioned this issue Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants