floating point to integer casts can cause undefined behaviour #10184

Open
thestinger opened this Issue Oct 31, 2013 · 118 comments

Comments

Projects
None yet
Contributor

thestinger commented Oct 31, 2013

UPDATE (by @nikomatsakis): After much discussion, we've got the rudiments of a plan for how to address this problem. But we need some help with actually investigating the performance impact and working out the final details!


ORIGINAL ISSUE FOLLOWS:

If the value cannot fit in ty2, the results are undefined.

1.04E+17 as u8
Contributor

brson commented Oct 31, 2013

Nominating

Member

pnkfelix commented Nov 7, 2013

accepted for P-high, same reasoning as #10183

Contributor

pcwalton commented Nov 22, 2013

I don't think this is backwards incompatible at a language level. It will not cause code that was working OK to stop working. Nominating.

Member

pnkfelix commented Dec 19, 2013

changing to P-high, same reasoning as #10183

Contributor

nrc commented Sep 12, 2014

How do we propose to solve this and #10185? Since whether behaviour is defined or not depends on the dynamic value of the number being cast, it seems the only solution is to insert dynamic checks. We seem to agree we do not want to do that for arithmetic overflow, are we happy to do it for cast overflow?

Contributor

pcwalton commented Sep 12, 2014

We could add an intrinsic to LLVM that performs a "safe conversion". @zwarich may have other ideas.

zwarich commented Sep 12, 2014

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Contributor

pcwalton commented Sep 12, 2014

Oh, that's easy enough then.

Contributor

nrc commented Apr 23, 2015

ping @pnkfelix is this covered by the new overflow checking stuff?

Contributor

bluss commented May 7, 2015

These casts are not checked by rustc with debug assertions.

Contributor

Aatch commented Sep 13, 2015

I'm happy to handle this, but I need a concrete solution. I personally think that it should be checked along with overflowing integer arithmetic, as it's a very similar issue. I don't really mind what we do though.

Note that this issue is currently causing an ICE when used in certain constant expressions.

Contributor

bluss commented Sep 13, 2015

This allows violating memory safety in safe rust, example from this forum post:

Undefs, huh? Undefs are fun. They tend to propagate. After a few minutes of wrangling..

#[inline(never)]
pub fn f(ary: &[u8; 5]) -> &[u8] {
    let idx = 1e100f64 as usize;
    &ary[idx..]
}

fn main() {
    println!("{}", f(&[1; 5])[0xdeadbeef]);
}

segfaults on my system (latest nightly) with -O.

Contributor

steveklabnik commented Oct 8, 2015

Marking with I-unsound given the violation of memory safety in safe rust.

Contributor

steveklabnik commented Oct 29, 2015

@bluss , this does not segfualt for me, just gives an assertion error. untagging since i was the one who added it

@steveklabnik steveklabnik added I-unsound and removed I-unsound labels Oct 29, 2015

Contributor

steveklabnik commented Oct 29, 2015

Sigh, I forgot the -O, re-tagging.

Contributor

nagisa commented Feb 21, 2016

re-nominating for P-high. Apparently this was at some point P-high but got lower over time. This seems pretty important for correctness.

EDIT: didn’t react to triage comment, adding label manually.

@nagisa nagisa added the I-nominated label Feb 21, 2016

Contributor

nikomatsakis commented Feb 25, 2016

It seems like the precedent from the overflow stuff (e.g. for shifting) is to just settle on some behavior. Java seems to produce the result modulo the range, which seems not unreasonable; I'm not sure just what kind of LLVM code we'd need to handle that.

Contributor

ranma42 commented Feb 26, 2016

According to https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.3 Java also guarantees that NaN values are mapped to 0 and infinities to the minimum/maximum representable integer. Moreover the Java rule for the conversion is more complex than just wrapping, it can be a combination of saturation (for the conversion to int or long) and wrapping (for the conversion to smaller integral types, if needed). Replicating the whole conversion algorithm from Java is certainly possible, but it would require a fair amount of operations for every cast. In particular, in order to ensure that the result of a fpto[us]i operation in LLVM does not exhibit undefined behaviour, a range check would be needed.

As an alternative, I would suggest that float->int casts are guaranteed to only be valid if the truncation of the original value can be represented as a value of the destination type (or maybe as [iu]size?) and to have assertions on debug builds that trigger a panic when the value has not been represented faithfully.

The main advantages of the Java approach are that the conversion function is total, but this also means that unexpected behaviour might creep in: it would prevent undefined behaviour, but it would be easy to be tricked into not checking if the cast actually made any sense (this is unfortunately true also for the other casts 😟 ).

The other approach matches the one currently used for arithmetic operations: simple & efficient implementation in release, panics triggered by range checking in debug. Unfortunately unlike other as casts, this would make such conversion checked, which can be surprising to the user (although maybe the analogy to arithmetic operations can help here). This would also break some code, but AFAICT it should only happen for code which is currently relying on undefined behaviour (i.e. it would replace the undefined behaviour "let's return any integer, you obviously don't care which" with a panic).

Member

retep998 commented Feb 26, 2016

The problem isn't "let's return any integer, you obviously don't care which", it is that it causes an undef which isn't a random value but rather a nasal demon value and LLVM is allowed to assume the undef never occurs enabling optimizations that do horrible incorrect things. If it was a random value, but crucially not undef, then that would be enough to fix the soundness issues. We don't need to define how unrepresentable values are represented, we just need to prevent undef.

Contributor

nikomatsakis commented Mar 3, 2016

Discussed in @rust-lang/compiler meeting. The most consistent course of action remains:

  1. when overflow checks are enabled, check for illegal casts and panic.
  2. otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

The main problem is that we need a concrete suggestion for option 2.

Contributor

nikomatsakis commented Mar 3, 2016

triage: P-medium

Contributor

glaebhoerl commented Mar 5, 2016

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

Contributor

oli-obk commented Mar 9, 2016

otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

Concrete suggestion: extract digits and exponent as u64 and bitshift digits by exponent.

fn f64_as_u64(f: f64) -> u64 {
    let (mantissa, exponent, _sign) = f.integer_decode();
    mantissa >> ((-exponent) & 63)
}

Yes it's not zero cost, but it's somewhat optimizable (would be better if we marked integer_decode inline) and at least deterministic. A future MIR-pass that expands a float->int cast could probably analyze whether the float is guaranteed to be ok to cast and skip this heavy conversion.

Member

eddyb commented Mar 9, 2016

Does LLVM not have platform intrinsics for the conversion functions?

EDIT: @zwarich said (a long time ago):

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

Contributor

nikomatsakis commented Mar 10, 2016

On Sat, Mar 05, 2016 at 03:47:55AM -0800, Gábor Lehel wrote:

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

True. I find that persuasive.

Contributor

nikomatsakis commented Mar 10, 2016

On Wed, Mar 09, 2016 at 02:31:05AM -0800, Eduard-Mihai Burtescu wrote:

Does LLVM not have platform intrinsics for the conversion functions?

EDIT:

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

Yes, I think I was mistaken before. as is the "unchecked truncation"
operator, for better or worse, and it seems best to stay consistent
with that philosophy. Using target-specific intrinsics may be a perfectly
fine solution though?

Contributor

sanmai-NL commented Jun 17, 2016

@nikomatsakis: it seems the behavior hasn't been defined yet? Can you give an update about the planning regarding that?

gmorenz commented Aug 21, 2016

Just ran into this with much smaller numbers

    let x: f64 = -1.0;
    x as u8

Results in 0, 16, etc. depending on optimizations, I was hoping it would be defined as 255 so I don't have to write x as i16 as u8.

Contributor

vks commented Aug 24, 2016

@gmorenz Did you try !0u8?

gmorenz commented Aug 24, 2016

In context that wouldn't make sense, I was getting the f64 from a transformation on data sent over the network, with a range of [-255, 255]. I was hoping it would wrap nicely (in the exact way that <i32> as u8 wraps).

Contributor

bstrie commented Oct 19, 2016

Here's a recent LLVM proposal to "kill undef" http://lists.llvm.org/pipermail/llvm-dev/2016-October/106182.html , though I'm hardly knowledgeable enough to know whether or not this would automagically resolve this issue.

Contributor

notriddle commented Nov 27, 2016

They're replacing undef with poison, the semantics being slightly different. It's not going to make int -> float casts defined behavior.

Contributor

nagisa commented Dec 2, 2016

We probably should provide some explicit way to do a saturating cast? I’ve wanted that exact behaviour just now.

Contributor

jorendorff commented Dec 10, 2016

Seems like this should be marked I-crash, given #10184 (comment) .

Contributor

steveklabnik commented Feb 25, 2017

We had a question about this in #rust-beginners today, someone ran into it in the wild.

Contributor

jorendorff commented May 3, 2017

The book I'm writing with @jimblandy, Programming Rust, mentions this bug.

Several kinds of casts are permitted.

  • Numbers may be cast from any of the built-in numeric types to any other.

    (...)

    However, as of this writing, casting a large floating-point value to an integer type that is too small to represent it can lead to undefined behavior. This can cause crashes even in safe Rust. It is a bug in the compiler, github.com/rust-lang/rust/issues/10184.

Our deadline for this chapter is May 19. I'd love to delete that last paragraph, but I feel like we should at least have some kind of plan here first.

Apparently current JavaScriptCore uses an interesting hack on x86. They use the CVTTSD2SI instruction, then fall back on some hairy C++ if the value is out of range. Since out-of-range values currently explode, using that instruction (with no fallback!) would be an improvement on what we have now, albeit only for one architecture.

Contributor

vks commented May 3, 2017

Honestly I think we should deprecate numeric casts with as and use From and TryFrom or something like the conv crate instead.

Contributor

jorendorff commented May 3, 2017

Maybe so, but that seems orthogonal to me.

Contributor

nikomatsakis commented May 4, 2017

OK, I've just re-read this whole conversation. I think there is agreement that this operation should not panic (for general consistency with as). There are two leading contenders for what the behavior ought to be:

  • Some sort of defined result
    • Pro: This I think is maximally consistent with our general philosophy thus far.
    • Con: There doesn't seem to be a truly portable way to produce any particular defined result in this case. This implies that we would be using platform-specific intrinsics with some sort of fallback for out of range values (for example, falling back to saturation, this function that @oli-obk proposed, to Java's definition, or to whatever "hairy C++" JSC uses.
    • At worst, we can just insert some ifs for the "out of range" cases.
  • An undefined value (not undefined behavior)
    • Pro: This lets us just use the platform-specific intrinsics that are available on each platform.
    • Con: It's a portability hazard. In general, I feel like we've not made use of undefined results very often, at least in the language (I'm sure we do in the libs in various places).

It's not clear to me if there is a clear precedent for what the result ought to be in the first case?

Contributor

nikomatsakis commented May 4, 2017

After having written that out, my preference would be to maintain a deterministic result. I feel like every place that we can hold the line on determinism is a win. I am not really sure what the result ought to be though.

I like saturation because I can understand it and it seems useful, but it seems somehow incongruent with the way that u64 as u32 does truncation. So perhaps some sort of result based on truncation makes sense, which I guess is probably what @oli-obk proposed -- I don't fully understand what that code is intended to do. =)

Contributor

oli-obk commented May 4, 2017

My code gives the correct value for things in the range 0..2^64 and deterministic but bogus values for everything else.

floats are represented by mantissa ^ exponent, e.g. 1.0 is (2 << 52) ^ -52 and since bitshifts and exponents are the same thing in binary, we can just reverse the shift (thus the negation of the exponent and the right shift).

Contributor

nikomatsakis commented Sep 22, 2017

I marked this as E-needs-mentor and tagged it with WG-compiler-middle since it seems the impl period might be a great time to investigate this further! My existing notes on the plan of record are pretty sparse though, so it'd be great if someone from @rust-lang/compiler wanted to help elaborate a those a bit further!

Contributor

arielb1 commented Sep 24, 2017

@nikomatsakis

IIRC LLVM are planning to eventually implement freeze, which should allow us to deal with the UB by doing a freeze.

s3bk commented Sep 24, 2017

My results so far: https://gist.github.com/s3bk/4bdfbe2acca30fcf587006ebb4811744

The _array variants run a loop of 1024 values.
_cast: x as i32
_clip: x.min(MAX).max(MIN) as i32
_panic: panics if x is out of bounds
_zero: sets the result to zero if out of bounds

test bench_array_cast       ... bench:       1,840 ns/iter (+/- 37)
test bench_array_cast_clip  ... bench:       2,657 ns/iter (+/- 13)
test bench_array_cast_panic ... bench:       2,397 ns/iter (+/- 20)
test bench_array_cast_zero  ... bench:       2,671 ns/iter (+/- 19)
test bench_cast             ... bench:           2 ns/iter (+/- 0)
test bench_cast_clip        ... bench:           2 ns/iter (+/- 0)
test bench_cast_panic       ... bench:           2 ns/iter (+/- 0)
test bench_cast_zero        ... bench:           2 ns/iter (+/- 0)

sp-1234 commented Sep 25, 2017

Perhaps you need not round the results to integer for individual operations. Clearly there must be some difference behind these 2 ns/iter. Or is it really like this, exactly 2 ns for all 4 variants?

Member

eddyb commented Sep 25, 2017

@sp-1234 I wonder if it's partially optimized out.

s3bk commented Sep 25, 2017

@sp-1234 It is too fast to measure. The non-array benchmarks are basically useless.
If you force the single-value functions to be functions via #[inline(never)], you get 2ns vs 3ns.

Contributor

Amanieu commented Sep 25, 2017

@arielb1
I have some reservations regarding freeze. If I understand correctly, a frozen undef can still contain any arbitrary value, it just won't change between uses. In practice, the compiler will probably reuse a register or stack slot.

However this means that we can now read uninitialized memory from safe code. This could lead to secret data being leaked, somewhat like Heartbleed. It's debatable whether this is truly considered UB from the point of view of Rust, but it clearly seems undesirable.

Contributor

rkruppe commented Sep 26, 2017

I ran @s3bk's benchmark locally. I can confirm the scalar versions are optimized out completely, and the asm for the array variants also looks suspiciously well-optimized: for example, the loops are vectorized, which is nice but makes it hard to extrapolate performance to scalar code.

Unfortunately, spamming black_box doesn't seem to help. I do see the asm doing useful work, but running the benchmark still consistently gives 0ns for the scalar benchmarks (except cast_zero, which shows 1ns). I see that @alexcrichton performed the comparison 100 times in their benchmarks, so I adopted the same hack. I'm now seeing these numbers (source code):

test bench_cast             ... bench:          53 ns/iter (+/- 0)
test bench_cast_clip        ... bench:         164 ns/iter (+/- 1)
test bench_cast_panic       ... bench:         172 ns/iter (+/- 2)
test bench_cast_zero        ... bench:         100 ns/iter (+/- 0)

The array benchmarks vary too much for me to trust them. Well, truth to be told, I'm skeptical of the test benchmarking infrastructure anyway, especially after seeing the above numbers compared to the flat 0ns I got previously. Furthermore, even just 100 iterations of black_box(x); (as a baseline) takes 34ns, which makes it even harder to reliably interpret those numbers.

Two points worth noting:

  • Despite not handling NaN specifically (it returns -inf instead of 0?), the cast_clip implementation appears to be slower than @alexcrichton's saturating cast (note that their run and mine have roughly the same timing for as casts, 53-54ns).
  • Unlike @s3bk's array results, I'm seeing cast_panic being slower than the other checked casts. I also see an even greater slowdown on the array benchmarks. Maybe these things are just highly dependent on microarchitectural details and/or optimizer mood?

For the record, I've measured with rustc 1.21.0-nightly (d692a91 2017-08-04), -C opt-level=3, on an i7-6700K under light load.


In conclusion, I conclude that don't have reliable data so far and that getting more reliable data seems hard. Furthermore, I strongly doubt that any real application spends even 1% of its wall clock time on this operation. Therefore, I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag, and then running some non-artificial benchmarks with and without this flag to determine the impact on realistic applications.

Edit: I would also recommend to run such benchmarks on a variety of architectures (e.g., including ARM) and microarchitectures, if at all possible.

I admit I'm not that familiar with rust, but I think this line is subtly incorrect: std::i32::MAX (2^31-1) is not exactly representable as a Float32, so std::i32::MAX as f32 will be rounded to the nearest representable value (2^31). If this value is used as the argument x, the result is technically undefined. Replacing with a strict inequality should fix this case.

Member

Manishearth commented Sep 27, 2017

Yeah, we had exactly that problem in Servo before. The final solution was to cast to f64 and then clamp.

There are other solutions but they're pretty tricky and rust doesn't expose nice APIs for dealing with this well.

s3bk commented Sep 27, 2017

using 0x7FFF_FF80i32 as upper limit and -0x8000_0000i32 should solve this without casting to f64.
edit: use the correct value.

I think you mean 0x7fff_ff80, but simply using a strict inequality would probably make the intention of the code clearer.

s3bk commented Sep 27, 2017

as in x < 0x8000_0000u32 as f32 ? That would probably be a good idea.

ActuallyaDeviloper commented Sep 28, 2017

I think of all the suggested deterministic options, clamping is geneally most useful one because I think it's done often anyway. If the conversion type would actually be documentated to be saturating, manual clamping would become unnecessary.

I am only a little worried about the suggested implementation because it doesn't properly translate to machine instructions and it relies heavily on branching. Branching makes the performance dependent on specific data patterns. In the test cases given above everything looks (comparatively) fast because the same branch is taken always and the processor has good branch prediction data from many previous loop iterations. The real world will probably not look like that. Additionally the branching hurts the ability of the compiler to vectorize the code. I disagree with the opinion of @rkruppe , that the operation shouldn't also be tested in combination with vectorization. Vectorization is important in high performance code and being able to vectorize simple casts on common architectures should be a crucial requirement.

For the reasons given above, I played around with an alternative branchless and data flow oriented version of @alexcrichton 's cast with saturation semantics and @simonbyrne 's fix. I implemented it for u16, i16 and i32 since they all have to cover slightly different cases which result in varying performance.

The results:

test i16_bench_array_cast       ... bench:          99 ns/iter (+/- 2)
test i16_bench_array_cast_clip  ... bench:         197 ns/iter (+/- 3)
test i16_bench_array_cast_clip2 ... bench:         113 ns/iter (+/- 3)
test i16_bench_cast             ... bench:          76 ns/iter (+/- 1)
test i16_bench_cast_clip        ... bench:         218 ns/iter (+/- 25)
test i16_bench_cast_clip2       ... bench:         148 ns/iter (+/- 4)
test i16_bench_rng_cast         ... bench:       1,181 ns/iter (+/- 17)
test i16_bench_rng_cast_clip    ... bench:       1,952 ns/iter (+/- 27)
test i16_bench_rng_cast_clip2   ... bench:       1,287 ns/iter (+/- 19)

test i32_bench_array_cast       ... bench:         114 ns/iter (+/- 1)
test i32_bench_array_cast_clip  ... bench:         200 ns/iter (+/- 3)
test i32_bench_array_cast_clip2 ... bench:         128 ns/iter (+/- 3)
test i32_bench_cast             ... bench:          74 ns/iter (+/- 1)
test i32_bench_cast_clip        ... bench:         168 ns/iter (+/- 3)
test i32_bench_cast_clip2       ... bench:         189 ns/iter (+/- 3)
test i32_bench_rng_cast         ... bench:       1,184 ns/iter (+/- 13)
test i32_bench_rng_cast_clip    ... bench:       2,398 ns/iter (+/- 41)
test i32_bench_rng_cast_clip2   ... bench:       1,349 ns/iter (+/- 19)

test u16_bench_array_cast       ... bench:          99 ns/iter (+/- 1)
test u16_bench_array_cast_clip  ... bench:         136 ns/iter (+/- 3)
test u16_bench_array_cast_clip2 ... bench:         105 ns/iter (+/- 3)
test u16_bench_cast             ... bench:          76 ns/iter (+/- 2)
test u16_bench_cast_clip        ... bench:         184 ns/iter (+/- 7)
test u16_bench_cast_clip2       ... bench:         110 ns/iter (+/- 0)
test u16_bench_rng_cast         ... bench:       1,178 ns/iter (+/- 22)
test u16_bench_rng_cast_clip    ... bench:       1,336 ns/iter (+/- 26)
test u16_bench_rng_cast_clip2   ... bench:       1,207 ns/iter (+/- 21)

The test was run on an Intel Haswell i5-4570 CPU and Rust 1.22.0-nightly.
clip2 is the new branchless implementation. It agrees with clip on all 2^32 possible f32 input values.

For the rng benchmarks, random input values are used that hit different cases often. This uncovers the extreme performance cost (roughly 10 times the normal cost!!!) that occurs if branch prediction fails. I think it is very important to consider this. It's not the average real world performance either, but it's still a possible case and some applications will hit this. People will expect a f32 cast to have consistent performance.

Assmbly comparison on x86: https://godbolt.org/g/AhdF71
The branchless version very nicely maps to the minss/maxss instructions.

Unfortunately I wasn't able to make godbolt generate ARM assembly from Rust, but here is a ARM comparison of the methods with Clang: https://godbolt.org/g/s7ronw
Without being able to test the code and knowing much of ARM: The code size seems smaller too and LLVM mostly generates vmax/vmin which looks promising. Maybe LLVM could be teached eventually to fold most of the code into a single instruction?

Contributor

rkruppe commented Sep 28, 2017

@ActuallyaDeviloper The asm and the benchmark results look very good! Furthermore, branchless code like yours is probably easier to generate in rustc than the nested conditionals of other solutions (for the record, I am assuming we want to generate inline IR instead of calling a lang item function). Thank you so much for writing this.

I have a question about u16_cast_clip2: it doesn't seem to handle NaN?! There is a comment talking about NaN, but I believe the function will pass NaN through unmodified and attempt to cast it to f32 (and even if it didn't, it would produce one of the boundary values rather than 0).

PS: To be clear, I was not trying to imply that it's unimportant whether the cast can be vectorized. It clearly is important if the surrounding code is otherwise vectorizable. But scalar performance is also important, as vectorization is often not applicable, and the benchmarks I was commenting on were not making any statement about scalar performance. Out of interest, have you checked the asm of the *array* benchmarks to see if they're still vectorized with your implementation?

@rkruppe You are right, I accidently swapped the sides of the if and forgot about that. f32 as u16 happend to do the right thing by truncating the upper 0x8000 away, so the tests didn't catch it either. I fixed the problem now by swapping the branches again and testing all methods with if (y.is_nan()) { panic!("NaN"); } this time.

I updated my previous post. The x86 code did not change significantly at all but unfortunately the change stops LLVM from generating vmax in the u16 ARM case for some reason. I assume this has to do with some details about NaN handling of that ARM instruction or maybe it's a LLVM limitation.

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

The array versions are vectorized.
Godbolt: https://godbolt.org/g/HnmsSV

Contributor

rkruppe commented Sep 29, 2017

Re: the ARM asm, I believe the reason vmax is not used any more is that it returns NaN if either operand is NaN. The code is still branchless, though, it just uses predicated moves (vmovgt, referring to to the result of the earlier vcmp with 0).

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

Ohhh, right. Nice.

Contributor

rkruppe commented Oct 8, 2017

I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag

I have implemented this and will file a PR once I also fixed #41799 and have a lot more tests.

@bstrie bstrie added the A-LLVM label Oct 9, 2017

Contributor

rkruppe commented Oct 9, 2017

#45134 has pointed out a code path that I missed (generation of LLVM constant expressions – this is separate from rustc's own constant evaluation). I'll roll a fix for that into the same PR, but it will take a little while longer.

Member

eddyb commented Oct 10, 2017

@rkruppe You should coordinate with @oli-obk so that miri ends up with the same changes.

Contributor

rkruppe commented Oct 11, 2017

Pull request is up: #45205

bors added a commit that referenced this issue Oct 18, 2017

Auto merge of #45205 - rkruppe:saturating-casts, r=<try>
Saturating casts between integers and floats

Introduces a new flag, `-Z saturating-float-casts`, which makes code generation for int->float and float->int casts safe (`undef`-free), implementing [the saturating semantics laid out by](#10184 (comment)) @jorendorff for float->int casts and overflowing to infinity for `u128::MAX` -> `f32`. Constant evaluation in trans also behaves the same way &ndash; even without the -Z flag, because it doesn't affect run time or type checking.
The HIR constant evaluator is unaffected because it currently works correctly for u128->f32 and returns an error for problematic float->casts. That error is conservatively correct and we should be careful to not accept more code in constant expressions.

Many thanks to @eddyb, whose APFloat port simplified many parts of this patch, and made HIR constant evaluation recognize dangerous float casts as mentioned above.
Also thanks to @ActuallyaDeviloper whose branchless implementation served as inspiration for this implementation.

cc #10184 #41799
fixes #45134

bors added a commit that referenced this issue Nov 7, 2017

Auto merge of #45205 - rkruppe:saturating-casts, r=eddyb
Saturating casts between integers and floats

Introduces a new flag, `-Z saturating-float-casts`, which makes code generation for int->float and float->int casts safe (`undef`-free), implementing [the saturating semantics laid out by](#10184 (comment)) @jorendorff for float->int casts and overflowing to infinity for `u128::MAX` -> `f32`.
Constant evaluation in trans was changed to behave like HIR const eval already did, i.e., saturate for u128->f32 and report an error for problematic float->int casts.

Many thanks to @eddyb, whose APFloat port simplified many parts of this patch, and made HIR constant evaluation recognize dangerous float casts as mentioned above.
Also thanks to @ActuallyaDeviloper whose branchless implementation served as inspiration for this implementation.

cc #10184 #41799
fixes #45134

bors added a commit that referenced this issue Nov 8, 2017

Auto merge of #45205 - rkruppe:saturating-casts, r=eddyb
Saturating casts between integers and floats

Introduces a new flag, `-Z saturating-float-casts`, which makes code generation for int->float and float->int casts safe (`undef`-free), implementing [the saturating semantics laid out by](#10184 (comment)) @jorendorff for float->int casts and overflowing to infinity for `u128::MAX` -> `f32`.
Constant evaluation in trans was changed to behave like HIR const eval already did, i.e., saturate for u128->f32 and report an error for problematic float->int casts.

Many thanks to @eddyb, whose APFloat port simplified many parts of this patch, and made HIR constant evaluation recognize dangerous float casts as mentioned above.
Also thanks to @ActuallyaDeviloper whose branchless implementation served as inspiration for this implementation.

cc #10184 #41799
fixes #45134
Contributor

rkruppe commented Nov 8, 2017

#45205 has been merged, so anyone can now (well, starting with the next nightly) measure the performance impact of saturation by passing -Z saturating-float-casts via RUSTFLAGS. [1] Such measurements would be very valuable for deciding how to proceed with this issue.

[1] Strictly speaking, this won't affect the non-generic, non-#[inline] portions of the standard library, so to be 100% accurate you'd want to locally build std with Xargo. However, I don't expect that there will be a lot of code affected by this (the various conversion trait impls are #[inline], for example).

Contributor

Gankro commented Nov 8, 2017

@rkruppe I suggest starting an internals/users page to collect data, in the same vein as https://internals.rust-lang.org/t/help-us-benchmark-incremental-compilation/6153/ (we can then also link people to that, rather than some random comments in our issue tracker)

Contributor

est31 commented Nov 8, 2017

@rkruppe you should create a tracking issue. This discussion is split up into two issues already. That's not good!

Contributor

rkruppe commented Nov 9, 2017

@Gankro Yeah I agree, but it may be a few days before I find the time to write that post properly, so I figured I'd solicit feedback from the people subscribed to this issue in the mean time.

@est31 Hmm. Although the -Z flag covers both cast directions (which may have been a mistake, in retrospect), it seems unlikely that we'll flip the switch on both at the same time, and there's little overlap between the two in terms of what must be discussed (e.g., this issue hinges on the performance of saturation, while in #41799 it's agreed upon what the right solution is).
It is a bit silly that benchmarks primarily targeted at this issue would also measure the impact of the fix to #41799, but that can at most lead to overreporting of performance regressions, so I'm sort of okay with that. (But if anyone is motivated to split the -Z flag into two, go ahead.)

I've considered a tracking issue for the task of removing the flag once it has outlived its usefulness, but I don't see the need to merging the discussions occuring here and in #41799.

Contributor

Gankro commented Nov 9, 2017

I have drafted up an internals post: https://gist.github.com/Gankro/feab9fb0c42881984caf93c7ad494ebd

Feel free to copy that, or just give me notes so I can post it. (note I'm a bit confused about the const fn behaviour)

Contributor

sunfishcode commented Nov 9, 2017

One additional tidbit is that the cost of float->int conversions is specific to the current implementation, rather than being fundamental. On x86, cvtss2sicvttss2si returns 0x80000000 in the too-low, too-high, and nan cases, so one could implement -Zsaturating-float-casts with a cvtss2sicvttss2si followed by special code in the 0x80000000 case, so it could be just a single compare-and-predictable-branch in the common case. On ARM, vcvt.s32.f32 has the -Zsaturating-float-casts semantics already. LLVM doesn't currently optimize the extra checks away in either case.

Contributor

rkruppe commented Nov 9, 2017

@Gankro

Awesome, thanks a lot! I left a few notes on the gist. After reading this, I'd like to take a stab at separating u128->f32 casts from the -Z flag. Just for the sake of getting rid of the distracting caveat about the flag covering two orthogonal features.

Contributor

rkruppe commented Nov 9, 2017

(I've filed #45900 to refocus the -Z flag so that it only covers the float->int issue)

Contributor

comex commented Nov 10, 2017

It would be nice if we could get platform-specific implementations a la @sunfishcode (at least for x86) before asking for mass benchmarking. It shouldn't be very difficult.

Contributor

sunfishcode commented Nov 10, 2017

The problem is that LLVM doesn't currently provide a way to do this, as far as I know, except maybe with inline asm which I wouldn't necessarily recommend for a release.

Contributor

Gankro commented Nov 10, 2017

I have updated the draft to reflect discussion (basically ripping out any inline mention of u128 -> f32 to an extra section at the end).

Contributor

comex commented Nov 10, 2017

@sunfishcode Are you sure? Isn't the llvm.x86.sse.cvttss2si intrinsic what you're looking for?

Here is a playground link that uses it:

https://play.rust-lang.org/?gist=33cf9e0871df2eb2475b845af4f1b574&version=nightly

In release mode, float_to_int_with_intrinsic and float_to_int_with_as both compile down to a single instruction. (In debug mode, float_to_int_with_intrinsic wastes a few instructions putting zero into the high, but it's not too bad.)

It even seems to do constant folding correctly. For example,

float_to_int_with_intrinsic(42.0)

becomes

movl	$42, %eax

But an out-of-range value,

float_to_int_with_intrinsic(42.0e33)

does not get folded:

cvttss2si	.LCPI2_0(%rip), %eax

(Ideally it would fold to constant 0x80000000, but that's no big deal. The important thing is that it doesn't produce undef.)

Contributor

sunfishcode commented Nov 10, 2017

Oh, cool. It looks like that would work!

Contributor

rkruppe commented Nov 10, 2017

It's cool to know that we do, after all, have a way to build on cvttss2si. However, I do not agree that it's clearly better to change the implementation to use it before we call for benchmarks:

Most people will benchmark on x86, so if we special case x86, we'll get far less data on the general implementation, which will still be used on most other targets. Admittedly, it's already difficult to infer anything about other architectures, but a wholly different implementation makes it outright impossible.

Second, if we collect benchmarks now, with the "simple" solution, and find that there are no performance regressions in real code (and tbh that's what I expect), then we don't even need to go through the trouble of trying to optimize this code path further.

Finally, I'm not even sure building on cvttss2si will be faster than what we have now (though on ARM, just using the appropriate instruction is clearly better):

  • You need a compare to notice that the conversion returns 0x80000000, and if that's the case you still need another compare (of the input value) to know whether you should return int::MIN or int::MAX. And if it's a signed integer type, I don't see how to avoid a third compare to distinguish NaN. So in the worst case:
    • you don't save in the number of compares/selects
    • you're trading a float compare for an int compare, which might be nice for OoO cores (if you're bottlenecked on FUs that can do compares, which seems like a relatively big if), but that compare is also dependent on the float->int comparison, while the comparisons in the current implementation are all independent, so it's far from obvious that this is a win.
  • Vectorization probably becomes more difficult or impossible. I don't expect that the loop vectorizer handles this intrinsic at all.
  • It's also worth noting that (AFAIK) this strategy only applies to some integer types. f32 -> u8, for example, will need additional fixups of the result, which make this strategy pretty clearly unprofitable. I'm not quite sure which types are affected by this (e.g., I don't know if there's an instruction for f32 -> u32), but an application that uses only those types won't benefit at all.
  • You could do a branching solution with only one compare in the happy path (as opposed to two or three compares, and thus branches, as previous solutions did). However, as @ActuallyaDeviloper argued earlier, branching may not be desirable: Performance now becomes even more workload-dependent and branch-prediction-dependent.
Contributor

bstrie commented Nov 11, 2017

Is it safe to assume that we're going to need a slew of unsafe fn as_u32_unchecked(self) -> u32 and friends regardless of what the benchmarking shows? What other potential recourse would someone have if they did end up observing a slowdown?

@bstrie I think it'd make more sense, in a case like that, to do something like extending the syntax to as <type> [unchecked] and requiring the unchecked only be present in unsafe contexts.

As I see it, a forest of _unchecked functions as variants of as casting would be a wart, both as far as intuitiveness goes, and when it comes to generating clean, usable documentation.

Contributor

bstrie commented Nov 11, 2017

@ssokolow Adding syntax should always be a last resort, especially if all of this can be taken care of with just ten rote functions. Even having a generic foo.as_unchecked::<u32>() would be preferable to syntactic changes (and the concomitant interminable bikeshed), especially since we ought to be reducing, not increasing, the number of things that unsafe unlocks.

Point. The turbofish slipped my mind when considering options and, in hindsight, I'm not exactly firing on all cylinders this evening either, so I should have been more cautious about commenting on design decisions.

That said, it feels wrong to bake the destination type into the function name... inelegant and a potential burden on future evolution of the language. The turbofish feels like a better option.

Contributor

SimonSapin commented Nov 11, 2017

A generic method could be supported by a new set of UncheckedFrom / UncheckedInto traits with unsafe fn methods, joining the From / Into and TryFrom / TryInto collection.

Contributor

rkruppe commented Nov 11, 2017

@bstrie One alternative solution for people whose code got slower could be to use an intrinsic (e.g., via stdsimd) to access the underlying hardware instruction. I argued earlier that this has downsides for the optimizer – auto-vectorization likely suffers, and LLVM can't exploit it returning undef on out-of-range inputs – but it does offer a way to do the cast without any extra work at run time. I can't decide if this is good enough, but it seems at least plausible that it might be.

Contributor

comex commented Nov 12, 2017

Some notes on conversions in the x86 instruction set:

SSE2 is actually relatively limited in which conversion operations it gives you. You have:

  • CVTTSS2SI family with 32-bit register: converts single float to i32
  • CVTTSS2SI family with 64-bit register: converts single float to i64 (x86-64 only)
  • CVTTPS2PI family: converts two floats to two i32s

Each of those has variants for f32 and f64 (as well as variants that round instead of truncating, but that's useless here).

But there is nothing for unsigned integers, nothing for sizes smaller than 32, and if you're on 32-bit x86, nothing for 64-bit. Later instruction set extensions add more functionality, but it seems barely anybody compiles for those.

As a result, the existing ('unsafe') behavior:

  • To convert to u32, compilers convert to i64 and truncate the resulting integer. (This produces odd behavior for out-of-range values, but that's UB so who cares.)
  • To convert to anything 16-bit or 8-bit, compilers convert to i64 or i32 and truncate the resulting integer.
  • To convert to u64, compilers generate a morass of instructions. For f32 to u64 GCC and LLVM generate an equivalent of:
fn f32_to_u64(f: f32) -> u64 {
    const CUTOFF: f32 = 0x8000000000000000 as f32; // 2^63 exactly
    if !(f >= CUTOFF) { // less, or NaN
        // just use the signed conversion
        f as i64 as u64
    } else {
        0x8000000000000000u64 + ((f - CUTOFF) as i64 as u64)
    }
}

Unrelated fun fact: "Convert-than-truncate" code generation is what causes the "parallel universes" glitch in Super Mario 64. The collision detection code first MIPS instruction to convert f32 coordinates to i32, then truncates to i16; thus, coordinates that fit in i16 but not i32 'wrap', e.g. going to coordinate 65536.0 gets you collision detection for 0.0.

Anyway, conclusions:

  • "Test for 0x80000000 and have special handler" only works for conversions to i32 and i64.

  • For conversions to u32, u/i16, and u/i8, however, "test if truncated/sign-extended output differs from original" is an equivalent. (This would scoop up both integers that were in range for the original conversion but out of range for the final type, and 0x8000000000000000, the indicator that the float was NaN or out of range for the original conversion.)

  • But the cost of a branch and a bunch of extra code for that case is probably overkill. It may be OK if branches can be avoided.

  • @ActuallyaDeviloper's minss/maxss based approach is not so bad! The minimal form,

minss %xmm2, %xmm1
maxss %xmm3, %xmm1
cvttss2si %rax, %xmm1

is only three instructions (which have decent code size and throughput/latency) and no branches.

However:

  • The pure-Rust version needs an extra test for NaN. For conversions to 32-bit or smaller, that can be avoided using intrinsics, by using 64-bit cvttss2si and truncating the result. If the input was not NaN, the min/max ensure that integer is unchanged by truncation. If the input was NaN, the integer is 0x8000000000000000 which truncates to 0.
  • I didn't include the cost of loading 2147483647.0 and -2148473648.0 into the registers, typically one mov from memory each.
  • For f32, 2147483647.0 cannot be represented exactly, so that doesn't actually work: you need another check. That makes things much worse. Ditto for f64 to u/i64, but f64 to u/i32 doesn't have this problem.

I suggest a compromise between the two approaches:

  • For f32/f64 to u/i16 and u/i8, and f64 to u/i32, go with min/max + truncation, as above, e.g.:
    let f = if f > 32767.0 { 32767.0 } else { f };
    let f = if f < -32768.0 { -32768.0 } else { f };
    cvttss2si(f) as i16

(For u/i16 and u/i8, the original conversion can be to i32; for f64 to u/i32, it needs to be to i64.)

  • For f32/64 to u32,
    let r = cvttss2si64(f) as u32;
    if f >= 4294967296.0 { 4294967295 } else { r }

is only a few instructions and no branches:

	cvttss2si	%xmm0, %rcx
	ucomiss	.LCPI0_0(%rip), %xmm0
	movl	$-1, %eax
	cmovbl	%ecx, %eax
  • For f32/64 to i64, maybe
    let r = cvttss2si64(f);
    if f >= 9223372036854775808. {
        9223372036854775807 
    } else if f != f {
        0
    } else {
        r
    }

This produces a longer (still branchless) sequence:

	cvttss2si	%xmm0, %rax
	xorl	%ecx, %ecx
	ucomiss	%xmm0, %xmm0
	cmovnpq	%rax, %rcx
	ucomiss	.LCPI0_0(%rip), %xmm0
	movabsq	$9223372036854775807, %rax
	cmovbq	%rcx, %rax

…but at least we save one comparison compared to the naive approach, as if f is too small, 0x8000000000000000 is already the correct answer (i.e. i64::MIN).

  • For f32 to i32, not sure whether it would be preferable to do the same as previous, or just convert to f64 first and then do the shorter min/max thing.

  • u64 is a mess I don't feel like thinking about. :p

fst3a added a commit to fst3a/gltf that referenced this issue Nov 14, 2017

Fix potentially invalid casts (also current potential UB hole).
Currently, out-of-range float->int casts are UB (in safe Rust!) which is obviously not good, though, current proposed fix is to make float->int cast saturate (see rust-lang/rust#10184).  This fix adds bunch of fail-safes for potentially out of bound values.  This code does not cover NaN, though, NaN to int will become 0, when the behaviour becomes well-defined.

@matklad matklad referenced this issue in neon-bindings/rfcs Nov 17, 2017

Open

WIP: RFC: ArrayBuffer Views #5

Contributor

rkruppe commented Nov 18, 2017

In https://internals.rust-lang.org/t/help-us-benchmark-saturating-float-casts/6231/14 someone reported a measurable and significant slowdown on JPEG encoding with the image crate. I've minimized the program so that it's self-contained and mostly focused on the parts that are related to the slowdown: https://gist.github.com/rkruppe/4e7972a209f74654ebd872eb4bc57722 (this program shows ~ 15% slowdown for me with saturating casts).

Note that the casts are f32->u8 (rgb_to_ycbcr) and f32->i32 (encode_rgb, "Quantization" loop) in equal proportions. It also looks like the inputs are all in range, i.e., saturation never actually kicks in, but in the case of the f32->u8 this can only be verified by calculating the minimum and maximum of a polynominal and accounting for rounding error, which is a lot to ask. The f32->i32 casts are more obviously in range for i32, but only because the elements of self.tables are nonzero, which is (apparently?) not that easy for the optimizer to show, especially in the original program. tl;dr: The saturation checks are there to stay, the only hope is making them faster.

I've also poked at the LLVM IR some -- it appears literally the only difference are the comparisons and selects from the saturating casts. A quick look indicates the asm has corresponding instructions and of course a bunch more live values (which lead to more spills).

@comex Do you think f32->u8 and f32->i32 casts can be made measurably faster with CVTTSS2SI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment