Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples of bad Rust SIMD perf? #135

Open
workingjubilee opened this issue Jun 9, 2021 · 6 comments
Open

Examples of bad Rust SIMD perf? #135

workingjubilee opened this issue Jun 9, 2021 · 6 comments
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@workingjubilee
Copy link
Member

One thing we could use to help check the library against is examples of Rust SIMD perf... and in particular, anything that is actually a regression, especially relative to expectations. In particular, it may help motivate a solution to rust-lang/rust#64609 if we can find examples of bad or divergent SIMD performance for Rust on a given architecture vs C (clang) on a given architecture for equivalent code. I had a conversation with compiler devs who are more familiar with the inner workings of LLVM and the compiler's SIMD machinery, and they expect LLVM to see through and properly handle the "pass through memory" trick if things are inlined. So we're looking for examples where LLVM mysteriously fails or just enough ops are done that LLVM decides inlining them all isn't practical.

This obviously is not at all the case where we just completely scalarize things, so we're ignoring #76 for the purposes of this example, and it doesn't actually have to be related to our core::simd implementation. Rather, it's just an overall concern: if we can cough up examples we can compare against, it would help us bench, profile, and test possible solutions.

I'm also not actually limiting this to just Rust vs. C, clang just happens to be there and is also LLVM-driven. Anything where our SIMD takes a beating vs. ${LANG} is a good example. And things where we're only on par

@workingjubilee workingjubilee added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Jun 9, 2021
@thomcc
Copy link
Member

thomcc commented Jul 25, 2021

vs C (clang) on a given architecture for equivalent code

What's equivalent? I think this might be tricky. I remember filing rust-lang/stdarch#1155 because the generated code in a hot loop was about 2x slower than expected. This was because it implemented the behavior of the LLVM intrinsics, which took around 3x as many instructions as the native intrinsics.

This is disappointing, since I suspect it means in practice that "portable simd" will always have a cost, and you'll be better off for the architecture-specific instructions if you can afford to write it, and know that you don't have the problem cases. (My hope is that some -ffast-math-style support can bridge the gap here, but I suspect it will be quite difficult to push all the way thorough)

This is different than what you're asking for around inlining failure. I expect that to happen around -Oz or -Os levels in some cases, which is unfortunate and kind of tricky to address even if we find it.

@andy-thomason
Copy link

It would be nice if "target" defaulted to "native" as the current default for x86_64 is for a rather ancient architecture.

The best way to make code portable, it seems, is to use conditional compilation for avx2 and other features.

I was thinking of a "go_faster!" macro that could wrap high level code and use the best features available
such as avx2, avx512, sve using a combination of conditional compilation and runtime switching.

We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation
wrappers. For example, round and min/max are currently multi instruction sequences.

@calebzulawski
Copy link
Member

It would be nice if "target" defaulted to "native" as the current default for x86_64 is for a rather ancient architecture.

This change would be much broader than SIMD, of course, but I think this is unlikely to ever happen because I think most of the time it is expected that your code will run on other machines with similar architecture, by default. This problem only affects x86-64, which is why clang etc have already added x86-64 levels (such as x86-64-v3) which is probably the best way to handle that. This is similar to how it's already handled for arm v7.

The best way to make code portable, it seems, is to use conditional compilation for avx2 and other features.

I was thinking of a "go_faster!" macro that could wrap high level code and use the best features available
such as avx2, avx512, sve using a combination of conditional compilation and runtime switching.

You may be interested in my multiversion crate.

We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation
wrappers. For example, round and min/max are currently multi instruction sequences.

All of the intrinsics we use right now generate code for the target feature level in the user's crate (they are all inline functions). If anything is resulting in suboptimal codegen either it's a limitation of your target features, or it may be a bug in LLVM.

@workingjubilee
Copy link
Member Author

workingjubilee commented Oct 5, 2021

These are some examples I found in old Rust issues that seem to qualify under this problem. Interestingly, it seems that C++ may be a bigger rival than C, here.

@workingjubilee
Copy link
Member Author

IIRC some of the SIMD dialects, and certainly LLVM, allow immediates to describe some vector patterns, so we should check whether we actually emit that asm when it is in fact const-known:

@tavianator
Copy link

We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation
wrappers. For example, round and min/max are currently multi instruction sequences.

All of the intrinsics we use right now generate code for the target feature level in the user's crate (they are all inline functions). If anything is resulting in suboptimal codegen either it's a limitation of your target features, or it may be a bug in LLVM.

simd_min()/simd_max() generate something like this on x86:

        vminps  ymm2, ymm1, ymm0
        vcmpunordps     ymm0, ymm0, ymm0
        vblendvps       ymm0, ymm2, ymm1, ymm0

in order to have the right semantics if an argument is NaN. If you just want minps (because you don't care about NaN or you actually want those semantics), x.simd_lt(y).select(x, y) generates

vminps  ymm2, ymm1, ymm0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

5 participants