Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more simd_* intrinsics #790

Merged
merged 17 commits into from Dec 18, 2019
Merged

Conversation

@bjorn3
Copy link
Contributor

bjorn3 commented Jul 31, 2019

I currently only did this for x86. Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

cc #788

@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Jul 31, 2019

Also I skipped _mm_sqrt_ps and some more, as llvm emitted rsqrtps combined with a lot of extra instructions instead of sqrtps, causing slight rounding errors and non optimal codegen.

I'll look into that.

@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Jul 31, 2019

Got the same for _mm256_sqrt_ps. The pd versions were working correcly in both cases.

@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Jul 31, 2019

#![feature(platform_intrinsics)]

extern crate core;

use core::arch::x86_64::__m128;

extern "platform-intrinsic" {
    fn simd_fsqrt<T>(a: T) -> T;
}

pub unsafe fn sqrt(a: __m128) -> __m128 {
    simd_fsqrt(a)
}

Optimized LLVM:

; playground::sqrt
; Function Attrs: nofree nounwind nonlazybind uwtable
define void @_ZN10playground4sqrt17h5d635885a5180697E(<4 x float>* noalias nocapture sret dereferenceable(16), <4 x float>* noalias nocapture readonly dereferenceable(16) %a) unnamed_addr #0 {
start:
  %1 = load <4 x float>, <4 x float>* %a, align 16
  %2 = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %1)
  store <4 x float> %2, <4 x float>* %0, align 16
  ret void
}

Optimized asm:

.LCPI0_0:
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5
	.long	3204448256              # float -0.5

.LCPI0_1:
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3
	.long	3225419776              # float -3

playground::sqrt: # @playground::sqrt
# %bb.0:
	movq	%rdi, %rax
	movaps	(%rsi), %xmm0
	rsqrtps	%xmm0, %xmm1
	movaps	%xmm0, %xmm2
	mulps	%xmm1, %xmm2
	movaps	.LCPI0_0(%rip), %xmm3   # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
	mulps	%xmm2, %xmm3
	mulps	%xmm1, %xmm2
	addps	.LCPI0_1(%rip), %xmm2
	xorps	%xmm1, %xmm1
	cmpneqps	%xmm0, %xmm1
	mulps	%xmm3, %xmm2
	andps	%xmm2, %xmm1
	movaps	%xmm1, (%rdi)
	retq
@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Jul 31, 2019

I have gone through every llvm intrinsic for x86 and x86_64 to see if there is a simd_* replacement.

Copy link
Collaborator

gnzlbg left a comment

I've left some questions.

crates/core_arch/src/x86/fma.rs Show resolved Hide resolved
crates/core_arch/src/x86/avx.rs Show resolved Hide resolved
crates/core_arch/src/x86/sse.rs Outdated Show resolved Hide resolved
@gnzlbg gnzlbg closed this Aug 2, 2019
@gnzlbg gnzlbg reopened this Aug 2, 2019
@@ -255,7 +255,7 @@ pub unsafe fn _mm256_andnot_ps(a: __m256, b: __m256) -> __m256 {
#[cfg_attr(test, assert_instr(vmaxpd))]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm256_max_pd(a: __m256d, b: __m256d) -> __m256d {
maxpd256(a, b)
simd_fmax(a, b)

This comment has been minimized.

Copy link
@gnzlbg

gnzlbg Aug 5, 2019

Collaborator

Is the behavior of these the same, e.g., for subnormals, when one argument contain NaNs, etc. ?

@@ -219,6 +220,7 @@ pub unsafe fn _mm_max_ss(a: __m128, b: __m128) -> __m128 {
#[cfg_attr(test, assert_instr(maxps))]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm_max_ps(a: __m128, b: __m128) -> __m128 {
// See the `test_mm_min_ps` test why this can't be implemented using `simd_fmax`.

This comment has been minimized.

Copy link
@gnzlbg

gnzlbg Aug 5, 2019

Collaborator

I think it would be better to add similar tests to the other intrinsics using simd_fmax and simd_fmin, that check subnormals, and also that check the behavior when the first argument is nan, and the second non-nan, and viceversa.

This comment has been minimized.

Copy link
@bjorn3

bjorn3 Aug 5, 2019

Author Contributor

How do I create a subnormal? As far as I understand they are close to zero, but I don't know how close.

This comment has been minimized.

Copy link
@gnzlbg

gnzlbg Aug 5, 2019

Collaborator

How do I create a subnormal?

Check out the docs for f{32,64}::is_normal(). Each floating-point type has a MIN_POSITIVE number, and all numbers between that one and zero (I think in range: (-MIN_POSITIVE, MIN_POSITIVE)) are subnormal. I don't know if creating them from a literal returns 0.0 or not. But if they do, then checking permutations of -0.0, 0.0, and NaN should be enough, e.g., (-0.0, 0.0), (0.0, -0.0), (1.0, NaN), (NaN, 1.0).

This comment has been minimized.

Copy link
@bjorn3

bjorn3 Aug 5, 2019

Author Contributor

0.000000000000000000000000000000000000000000001f32.is_normal() returns false and transmuting it to [u8; 4] gives [1, 0, 0, 0]. Do you want to check permutations with that number too? Or should I just use 0.0?

This comment has been minimized.

Copy link
@gnzlbg

gnzlbg Aug 5, 2019

Collaborator

Do you want to check permutations with that number too?

Yes, we should check that too :)

let b: [u8; 16] = transmute(b);
assert_eq!(r1, b);
assert_eq!(r2, a);
assert_ne!(a, b); // sanity check that -0.0 is actually present

This comment has been minimized.

Copy link
@gnzlbg

gnzlbg Aug 5, 2019

Collaborator

I think we need to also test here the behavior when the first argument is nan and the second is not, and vice versa (e.g. if the result the Nan? the second argument ? always the non-nan ? etc.).

crates/stdarch-test/src/lib.rs Show resolved Hide resolved
@gnzlbg gnzlbg closed this Aug 18, 2019
@gnzlbg gnzlbg reopened this Aug 18, 2019
@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Sep 1, 2019

LLVM doesn't use the simd instructions for certain intrinsics on i586.

@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Sep 6, 2019

@bjorn3 maybe we could use the generic intrinsics in some cases (e.g. #[cfg(target_feature = "sse2")] ?), and the specific ones in others ?

@bjorn3 bjorn3 force-pushed the bjorn3:use_more_simd_x_intrinsics branch from 9019582 to 03a312c Nov 26, 2019
@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Nov 26, 2019

Rebased to trigger CI, as the old logs are no longer available.

@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Nov 26, 2019

Windows build failed while installing rust:

Run rustup update nightly --no-self-update && rustup default nightly
At D:\a\_temp\0855049a-8a0a-4cb8-bf51-de53a4f07b31.ps1:2 char:40
+ rustup update nightly --no-self-update && rustup default nightly
+                                        ~~
The token '&&' is not a valid statement separator in this version.
+ CategoryInfo          : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : InvalidEndOfLine
@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Nov 27, 2019

rustup update nightly --no-self-update && rustup default nightly

Can you split this statement into two different lines and try again?

rustup update nightly --no-self-update
rustup default nightly
@makotokato

This comment has been minimized.

Copy link
Contributor

makotokato commented Dec 17, 2019

Windows build failed while installing rust:

This is fixed by ac59837

@bjorn3 bjorn3 force-pushed the bjorn3:use_more_simd_x_intrinsics branch from 03a312c to 4c7d4b5 Dec 17, 2019
@gnzlbg gnzlbg closed this Dec 17, 2019
@gnzlbg gnzlbg reopened this Dec 17, 2019
@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Dec 17, 2019

Closing / reopening to re-trigger CI.

On i586 the simd_* intrinsics don't compile to MMX instructions, even
with `#[target_feature(enable = "mmx")]`.
@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Dec 17, 2019

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

@bjorn3

This comment has been minimized.

Copy link
Contributor Author

bjorn3 commented Dec 17, 2019

CI is finally happy!

@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Dec 18, 2019

Reverted the mmx changes, as those are the ones not compiling to the required instruction.

Uh, sorry, my fault, I should have caught this. Yes, mmx intrinsics (or those using the _m64 type in general) won't work with the generic simd_ intrinsics. I wouldn't worry about that, _m64 creates so many headaches that few people are using it, and also, chances are we will never stabilize it.

@gnzlbg gnzlbg merged commit b51ba3f into rust-lang:master Dec 18, 2019
29 checks passed
29 checks passed
Check Style
Details
Build Documentation
Details
Automatic intrinsic verification
Details
Env Override
Details
Test (i686-unknown-linux-gnu)
Details
Test (x86_64-unknown-linux-gnu)
Details
Test (x86_64-unknown-linux-gnu-emulated)
Details
Test (arm-unknown-linux-gnueabihf)
Details
Test (armv7-unknown-linux-gnueabihf)
Details
Test (aarch64-unknown-linux-gnu)
Details
Test (powerpc64le-unknown-linux-gnu)
Details
Test (mips-unknown-linux-gnu)
Details
Test (mips64-unknown-linux-gnuabi64)
Details
Test (mips64el-unknown-linux-gnuabi64)
Details
Test (s390x-unknown-linux-gnu)
Details
Test (wasm32-unknown-unknown)
Details
Test (i586-unknown-linux-gnu)
Details
Test (x86_64-linux-android)
Details
Test (arm-linux-androideabi)
Details
Test (mipsel-unknown-linux-musl)
Details
Test (aarch64-linux-android)
Details
Test (nvptx64-nvidia-cuda)
Details
Test (thumbv6m-none-eabi)
Details
Test (thumbv7m-none-eabi)
Details
Test (thumbv7em-none-eabi)
Details
Test (thumbv7em-none-eabihf)
Details
Test (x86_64-apple-darwin)
Details
Test (x86_64-pc-windows-msvc)
Details
x86_64-unknown-freebsd Task Summary
Details
@gnzlbg

This comment has been minimized.

Copy link
Collaborator

gnzlbg commented Dec 18, 2019

Thank you @bjorn3 for working on this!

@bjorn3 bjorn3 deleted the bjorn3:use_more_simd_x_intrinsics branch Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.