memcpy and co. are really unoptimized #339

CryZe · 2020-02-07T16:23:14Z

I've seen the assembly of the memcpy here for both powerpc and wasm (where the one here seems to be used unless you use the WASI target) and in both cases really simplistic byte wise loops are emitted. I haven't really done any benchmarks but these likely perform worse than more optimized loops that work on 32-bit or so at a time.

zhaofengli · 2020-05-03T21:00:56Z

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

josephlr · 2020-05-18T05:18:06Z

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

I would caution against using Redox's (current) memcpy implementation as it relies on undefined behavior (doing an unaligned load on a usize). See: https://gitlab.redox-os.org/redox-os/kernel/issues/99. This is UB on all platforms, even those that allow for unaligned loads, and will immediately fault on platforms that ban unaligned loads (like ARM).

If you need an optimized implementation, you should probably just link against a libc that provides them. Example of how to do this for musl on bare-metal x86_64: rust-osdev/cargo-xbuild#77 (comment)

pca006132 · 2020-08-19T09:22:45Z

May I ask that is there a way to specify other memcpy implementation? I'm on ARM platform, and the memcpy is not very efficient for my use case as I have many [u8] slices that have to be copied around.

I've tried to modify the target triple to not include the name -none to circumvent issue #369 , and managed to link against the ARM GCC Embedded Toolchain for the required memcpy and related symbols. However, the actual output is still linked to the compiler builtins implementation. It seems that the code to copy slices is still calling the __aeabi_* functions which uses the implementation defined in mem.rs.

compiler-builtins/src/arm.rs

Lines 143 to 148 in f4c7940

    
           #[cfg(not(target_os = "ios"))] 
        
           #[cfg_attr(not(feature = "mangled-names"), no_mangle)] 
        
           #[cfg_attr(thumb, linkage = "weak")] 
        
           pub unsafe extern "aapcs" fn __aeabi_memcpy(dest: *mut u8, src: *const u8, n: usize) { 
        
               ::mem::memcpy(dest, src, n); 
        
           }

Related: #253

josephlr · 2020-10-02T06:23:58Z

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

pca006132 · 2020-10-02T10:09:25Z

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

It is good to have optimized variants for some more targets. However, I think embedded systems may have some different requirements than normal applications, and the best way would be to allow them to provide their own implementation if they have to. For example, for armv7a processors, we could use NEON optimized memcpy if we enabled the FPU, but there may be bare-metal programs running before enabling the FPU, so probably they want an unoptimized memcpy implementation which does not use NEON.

I think the easiest way is to use weak linkage, so users can replace it with their own implementation if they want, such as using the implementation from newlib. #378

Demindiro · 2022-07-07T12:09:59Z

I've experimented a bit with faster mem* functions on a Ryzen 2700X and a i5-5287U and noted the following:

For small copies/sets it is significantly faster to do two unaligned load stores like so:

#[inline]
unsafe fn small_copy(dest: *mut u8, src: *const u8, count: usize) {
    if count < 2 {
        *dest = *src;
        return
    }
    if count <= 4 {
        let a = src.cast::<u16>().read_unaligned();
        let b = src.add(count - 2).cast::<u16>().read_unaligned();
        dest.cast::<u16>().write_unaligned(a);
        dest.add(count - 2).cast::<u16>().write_unaligned(b);
        return
    }
    if count <= 8 {
        let a = src.cast::<u32>().read_unaligned();
        let b = src.add(count - 4).cast::<u32>().read_unaligned();
        dest.cast::<u32>().write_unaligned(a);
        dest.add(count - 4).cast::<u32>().write_unaligned(b);
        return
    }
    if count <= 16 {
        let a = src.cast::<u64>().read_unaligned();
        let b = src.add(count - 8).cast::<u64>().read_unaligned();
        dest.cast::<u64>().write_unaligned(a);
        dest.add(count - 8).cast::<u64>().write_unaligned(b);
        return
    }
    if count <= 32 {
        let a = src.cast::<u128>().read_unaligned();
        let b = src.add(count - 16).cast::<u128>().read_unaligned();
        dest.cast::<u128>().write_unaligned(a);
        dest.add(count - 16).cast::<u128>().write_unaligned(b);
        return
    }
}

For large copies it is faster to use non-temporal stores. This is probably correlated with L2 cache size. This does not seem to apply to rep stos.
It's probably worth checking & caching CPUID for ERMS, FSRM et al. and use only rep movsb when applicable.

My main concern is that these optimizations can significantly bloat the mem* functions, so while the performance may be better in benchmarks it may be worse in practice since it uses much more i-cache space.

pca006132 · 2022-07-07T12:14:09Z

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

Demindiro · 2022-07-07T12:18:49Z

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

AFAICT it only does if it knows the size beforehand: https://rust.godbolt.org/z/KK7xj6Goh

pca006132 · 2022-07-07T13:05:54Z

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

bjorn3 · 2022-07-07T13:29:39Z

For large copies it is faster to use non-temporal stores.

Non-temporal stores bypass the cache, so it is expected that they are faster. Unfortunately this also means that unless you take care to ensure that the target of the write wasn't in the cache in the first place, you can get inconsistent results as future reads may read it from the cache again and I think if the cacheline is marked as modified that flushing it would overwrite the non-temporal store.

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

Glibs is LGPL licensed.
These memory operation implementations in compiler-builtins are only used on embedded systems where size is important and on wasm where you won't be able to vectorize much anyway. (The simd feature only gives 128bit vectors and in most cases rust wasm programs are compiled without simd support.)

josephlr · 2022-07-21T08:53:14Z

We could also add an optimized version using RISC-V's Vector Extension (aka the V Extension). This idea was explored in this Reddit post and this writeup. It seems fairly fast and simple.

If we do:

asm!(
  "vsetvli  {n}, {count}, e8, m8, ta, ma",
  "vle8.v   v0, ({src})",
  "vse8.v   v0, ({dest})",
  ...
);

in a loop, we get assembly (Godbolt link) that is nearly identical to the handwritten example of a memcpy implementation in the spec.

I'm not super familiar with the V extension, but I imagine we could have similar implementations for the other memory functions.

bjorn3 · 2022-07-21T09:29:39Z

It would be nice if the loop itself could be inside the asm block. Codegen backends which don't have native inline asm support (like my cg_clif) turn asm blocks into calls to a function defined in assembly compiled using an external assembler. Doing such calls in a loop has a decent amount of overhead compared to doing it once and having the loop inside the asm block.

Lokathor · 2022-07-31T19:47:58Z

I've recently written some ARM memcpy implementations. Hopefully they can be PR'd into this repo at some point. The only catch is that they're handwritten a32 code and some of the newer ARM targets don't even support using a32 code at all, so they can't be instantly dropped in for all ARM targets.

if anyone wants to get started the code is over here (and all Apache-2.0 OR MIT just like normal for rust): https://github.com/Lokathor/arm7tdmi_aeabi/blob/b0861b2ec026eaad0142d75b138a2e02c6f38cb9/src/the_code.s

Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)

ChrisDenton · 2023-11-01T09:18:21Z

For very small memcpys, could we not just use a couple of registers like llvm does? This should not much affect binary size, no?

josephlr mentioned this issue May 18, 2020

How to build *compiler_builtins* in optimized mode rust-osdev/cargo-xbuild#77

Closed

josephlr mentioned this issue Jul 8, 2020

Use REP MOVSQ/STOSQ on x86_64 #365

Merged

josephlr mentioned this issue Nov 3, 2020

Use REP MOVSB/STOSB when the ERMSB feature is present #392

Merged

nbdd0121 mentioned this issue Feb 9, 2021

Optimize memcpy, memmove and memset #405

Merged

CryZe mentioned this issue Dec 30, 2021

Emitted memset and memcpy are really slow on WASM rust-lang/rust#92436

Open

piegamesde mentioned this issue Jan 25, 2022

memcpy implementation is too large on embedded rust-lang/rust#93265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memcpy and co. are really unoptimized #339

memcpy and co. are really unoptimized #339

CryZe commented Feb 7, 2020 •

edited

Loading

zhaofengli commented May 3, 2020

josephlr commented May 18, 2020 •

edited

Loading

pca006132 commented Aug 19, 2020

josephlr commented Oct 2, 2020 •

edited

Loading

pca006132 commented Oct 2, 2020

Demindiro commented Jul 7, 2022

pca006132 commented Jul 7, 2022

Demindiro commented Jul 7, 2022

pca006132 commented Jul 7, 2022

bjorn3 commented Jul 7, 2022

josephlr commented Jul 21, 2022

bjorn3 commented Jul 21, 2022

Lokathor commented Jul 31, 2022

ChrisDenton commented Nov 1, 2023

memcpy and co. are really unoptimized #339

memcpy and co. are really unoptimized #339

Comments

CryZe commented Feb 7, 2020 • edited Loading

zhaofengli commented May 3, 2020

josephlr commented May 18, 2020 • edited Loading

pca006132 commented Aug 19, 2020

josephlr commented Oct 2, 2020 • edited Loading

pca006132 commented Oct 2, 2020

Demindiro commented Jul 7, 2022

pca006132 commented Jul 7, 2022

Demindiro commented Jul 7, 2022

pca006132 commented Jul 7, 2022

bjorn3 commented Jul 7, 2022

josephlr commented Jul 21, 2022

bjorn3 commented Jul 21, 2022

Lokathor commented Jul 31, 2022

ChrisDenton commented Nov 1, 2023

CryZe commented Feb 7, 2020 •

edited

Loading

josephlr commented May 18, 2020 •

edited

Loading

josephlr commented Oct 2, 2020 •

edited

Loading