Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support arrays of any size, don't require an initialization value, .. #4

Merged
merged 5 commits into from
Oct 31, 2017

Conversation

japaric
Copy link
Member

@japaric japaric commented Oct 3, 2017

single producer single consumer support for ring buffer

single producer single consumer support for ring buffer
@pftbest
Copy link

pftbest commented Oct 3, 2017

Sorry for interrupting, but I'm curious, why did you decide to use usize for head and tail instead of atomics. Is it for making it work on armv6 which doesn't have atomics? If I understand correctly, volatile operations are only safe on single core CPUs. On multi core CPUs you need some kind of memory fence to share the data in buffer between cores. Best option is release-acquire synchronization using atomics.

@pftbest
Copy link

pftbest commented Oct 3, 2017

@japaric Not so long ago I wrote a similar spsc ring buffer and tested it using thread sanitizer

https://play.rust-lang.org/?gist=4ebd95ec79757a7d2b625cb85d58300c&version=stable

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@pftbest yes and yes. I'm also mostly interested in using this crate with single core processors (microcontrollers) but it's a good idea to note that this won't work on multicore processors. We can revisit the atomic implementation if / when rust-lang/rust#45085 otherwise it's tricky to write an implementation for ARMv6-M.

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@homunkulus r+

@homunkulus
Copy link
Contributor

📌 Commit da5757a has been approved by japaric

@homunkulus
Copy link
Contributor

⌛ Testing commit da5757a with merge d7939bb...

japaric pushed a commit that referenced this pull request Oct 31, 2017
support arrays of any size, don't require an initialization value, ..

single producer single consumer support for ring buffer
@homunkulus
Copy link
Contributor

💔 Test failed - status-travis

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@homunkulus retry

@homunkulus
Copy link
Contributor

⌛ Testing commit da5757a with merge 10c5542...

japaric pushed a commit that referenced this pull request Oct 31, 2017
support arrays of any size, don't require an initialization value, ..

single producer single consumer support for ring buffer
@homunkulus
Copy link
Contributor

☀️ Test successful - status-travis
Approved by: japaric
Pushing 10c5542 to master...

@homunkulus homunkulus merged commit da5757a into master Oct 31, 2017
@japaric japaric deleted the v2 branch October 31, 2017 18:02
japaric added a commit that referenced this pull request May 10, 2022
these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent
japaric added a commit that referenced this pull request May 10, 2022
these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent
bors bot added a commit that referenced this pull request May 10, 2022
290: optimize the codegen of Vec::clone r=japaric a=japaric

these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent

Co-authored-by: Jorge Aparicio <jorge.aparicio@ferrous-systems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants