This repository has been archived by the owner. It is now read-only.

min, max, minNum, maxNum #67

Closed
sunfishcode opened this Issue Sep 22, 2014 · 6 comments

Comments

Projects
None yet
4 participants
@sunfishcode
Copy link
Contributor

sunfishcode commented Sep 22, 2014

There are common idioms with min and max which depend on the handling of NaN, and major applications which depend on them, on other platforms. If we try to leave the behavior on NaN undefined, it's very likely that applications will come to depend on whatever popular implementations happen to do.

For example, HLSL tried to make min/max undefined on NaN. However, the HLSL bytecode ended up having to define the behavior of min/max on NaN.

I propose we change the SIMD min and max functions:

  • change the polyfill for min and max to use Math.min and Math.max
  • require that when either operand is a NaN, the result is a NaN
  • require that 0.0 is considered greater than -0.0 for min and max

Also, I propose we add minNum and maxNum functions, following IEEE-754. These functions return the number if one operand is NaN. The handling of -0.0 is unspecified in IEEE-754, but I further propose here that we define it as 0.0 being treated as greater than -0.0 for the purpose of these functions (though this does add a cost on x86, discussed below).

On NEON, all of these functions map to a single instruction: VMIN, MAX, VMINNM, and VMAXNM (with the possible exception of handling of -0.0 on VMINNM and VMAXNM, as the documentation is vague, but I'm tentatively guessing it does what we want).

x86 is a challenge here. I came up with the following possible sequences. Note that this is the AVX version; non-AVX would require more copies due to lack of 3-operand VEX encoding, and pre-SSE4.1 may require and+andn+or instead of blendvps. Note also that actual compilers could reuse registers more aggressively in these sequences.

min:
vminps %xmm0, %xmm1, %xmm2
vminps %xmm1, %xmm0, %xmm3
vorps %xmm2, %xmm3, %xmm4 # NaN or'd with arbitrary bits is NaN

max:
vcmpunordps %xmm0, %xmm1, %xmm2 # where are the NaNs?
vmaxps %xmm0, %xmm1, %xmm3
vmaxps %xmm1, %xmm0, %xmm4
vandps %xmm3, %xmm4, %xmm5
vorps %xmm5, %xmm2, %xmm6 # or in the all-ones NaNs

minNum:
vcmpordps %xmm0, %xmm0, %xmm2 # where are the numbers in %xmm0?
vblendvps %xmm2, %xmm0, %xmm1, %xmm3 # %xmm3 has %xmm0 without the NaNs
vminps %xmm3, %xmm1, %xmm4 # minNum, except for negative zeros in %xmm1
vcmpeqps %xmm4, %xmm1, %xmm5
vorps %xmm4, %xmm1, %xmm6
vblendvps %xmm5, %xmm6, %xmm4, %xmm7

maxNum:
vcmpordps %xmm0, %xmm0, %xmm2 # where are the numbers in %xmm0?
vblendvps %xmm2, %xmm0, %xmm1, %xmm3 # %xmm3 has %xmm0 without the NaNs
vmaxps %xmm3, %xmm1, %xmm4 # maxNum, except for negative zeros in %xmm1
vcmpeqps %xmm4, %xmm1, %xmm5
vandps %xmm4, %xmm1, %xmm6
vblendvps %xmm5, %xmm6, %xmm4, %xmm7

If anyone can find faster sequences for any of these, I'd be very interested. Perhaps in the future x86 will add symmetric min/max and/or minNum/maxNum instructions, hopefully with -0.0 handling, which would make this much easier :-).

This proposal contravenes the "everything maps to a single instruction" expectation that some have, on x86, but I claim that portable NaN handling for min/max is important enough to justify it. -0.0 handling is less important, but it seems to be free on ARM and Power, and it's something the existing JS Math.min and Math.max already have anyway.

@sunfishcode

This comment has been minimized.

Copy link
Contributor

sunfishcode commented Sep 23, 2014

As discussed in #59, I just added implementations of _mm_max_ps and _mm_min_ps in Emscripten's <xmmintrin.h> following the x86 semantics (since <xmmintrin.h> is an x86-derived API). Consequently, they no longer use SIMD.float32x4.max or SIMD.float32x4.min, so the proposal above doesn't relate to them.

@sunfishcode

This comment has been minimized.

Copy link
Contributor

sunfishcode commented Sep 25, 2014

One other tidbit: when either operand of the min/max/minNum/maxNum is a constant, the NaN and negative zero concerns can easily be optimized away, yielding single-instruction sequences on all platforms.

@mhaghigh

This comment has been minimized.

Copy link
Contributor

mhaghigh commented Oct 2, 2014

Below is the feedback from our Numerics team regarding the code sequence above.

-moh

I think the min/max sequences shown on the page look good.
For MINNUM/MAXNUM, here are some alternatives that only use one blend. I am using pseudo-code so they can be easily translated to SSE2, SSE4 or AVX. (For SSE2, the blends would be emulated with logical ops.)

MINNUM:
R = MINPS(a,b) ; this is the result, or b if b is NaN
Mask = CMPNEPS(b,b) ; Mask=11..1 if b is NaN, 0 otherwise
Mask2 = PCMPEQ(a, 8000..0) ; Mask2=11..1 if a is -0, 0 otherwise
; note that this is the integer PCMPEQ instruction
Mask2 = AND(Mask2, 800..0) ; set Mask2 to 800..0 if a==-0, 0 otherwise
R = OR(R, Mask2) ; if a==-0, set sign in R (result of MIN is negative, or -0)
R = BLENDV(R, a, Mask) ; set result to a, if b is NaN

MAXNUM:
R = MAXPS(a,b) ; this is the result, or b if b is NaN
Mask = CMPNEPS(b,b) ; Mask=11..1 if b is NaN, 0 otherwise
Mask2 = PCMPEQ(a, 0000..0) ; Mask2=11..1 if a is +0, 0 otherwise
; note that this is the integer PCMPEQ instruction
Mask2 = AND(Mask2, 800..0) ; set Mask2 to 800..0 if a==+0, 0 otherwise
R = ANDNOT(Mask2, R) ; R=(~Mask2) & R, clears the sign of R if a==+0 (result of MIN is strictly positive, or +0)
R = BLENDV(R, a, Mask) ; set result to a, if b is NaN

@sunfishcode

This comment has been minimized.

Copy link
Contributor

sunfishcode commented Oct 3, 2014

Thanks Moh, and the Numerics team :-). Those MINNUM and MAXNUM sequences do indeed look better than the ones I posted above, even on machines with blendv.

@sunfishcode

This comment has been minimized.

Copy link
Contributor

sunfishcode commented Oct 13, 2014

Here's native <xmmintrin.h> code for the negative-zero and NaN adjusted max algorithm above:

__m128 newmax(__m128 a, __m128 b) {
__m128 n = _mm_cmpunord_ps(a, b);
__m128 x = _mm_max_ps(a, b);
__m128 y = _mm_max_ps(b, a);
__m128 z = _mm_and_ps(x, y);
__m128 r = _mm_or_ps(z, n);
return r;
}

This sequence isn't intended to be a replacement for _mm_max_ps, but it's interesting to benchmark against _mm_max_ps since that's a kind of theoretical limit. If you have code which uses _mm_max_ps and doesn't care about NaN or negative zero, try replacing _mm_max_ps calls with calls to this function, benchmark it on your favorite native compiler, and report back here. Whoever finds the most interesting slowdown wins! (interesting equals slowdown times realism or so).

I replaced _mm_max_ps with newmax in a few arbitrarily chosen real-world compute kernels, where I wasn't able to measure any difference in performance. Of course, this isn't necessarily representative of anything. I also constructed the worst-possible artificial benchmark I could come up with at a brief try, and came up with code that took almost exactly 3x the time to execute when using newmax compared to using _mm_max_ps.

To be sure, 3x is a lot. This is also an extreme case where I purposely tried to make it look bad. Also keep in mind that this situation is unique to min and max (and maybe clamp), so it doesn't reflect the situation for the rest of SIMD.js.

So, the current time to beat is 3x on an artificial benchmark. Can you make newmax look worse, either with a bigger slowdown, or a more realistic testcase?

@chadaustin

This comment has been minimized.

Copy link

chadaustin commented Oct 13, 2014

Just FYI: I don't think I'll have time for a while to investigate this. We are in the middle of some critically high priority projects.

At a high level, in the algorithm:

__m128 newmax(__m128 a, __m128 b) {
    __m128 n = _mm_cmpunord_ps(a, b);
    __m128 x = _mm_max_ps(a, b);
    __m128 y = _mm_max_ps(b, a);
    __m128 z = _mm_and_ps(x, y);
    __m128 r = _mm_or_ps(z, n);
    return r;
}

The cost of the algorithm is going to depend heavily on whether the above instructions can be interleaved with other work by an out-of-order core.

n, x, y can be computed in parallel. then z, then r. So the critical path is 3 instructions.

If there are spare execution units (that is, you are sequence- or latency-bound), the cost will be less than if you don't have spare execution units in the particular algorithm.

sunfishcode added a commit to sunfishcode/ecmascript_simd that referenced this issue Oct 20, 2014

Implement new min/max semantics
Following the proposal in
tc39#67

These min and max functions follow the Math.min and Math.max semantics,
and are arguably the more intuitive and semantics.

This patch also introduces minNum and maxNum functions, which follow the
IEEE-754 minNum and maxNum semantics.

sunfishcode added a commit to sunfishcode/ecmascript_simd that referenced this issue Oct 20, 2014

Implement new min/max semantics
Following the proposal in
tc39#67

These min and max functions follow the Math.min and Math.max semantics,
and are arguably the more intuitive and semantics.

This patch also introduces minNum and maxNum functions, which follow the
IEEE-754 minNum and maxNum semantics.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.