Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sqrt intrinsic for floating point sqrt #5686

Closed
brson opened this issue Apr 3, 2013 · 12 comments
Closed

Use sqrt intrinsic for floating point sqrt #5686

brson opened this issue Apr 3, 2013 · 12 comments
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@brson
Copy link
Contributor

brson commented Apr 3, 2013

sqrt is still using libc

@huonw
Copy link
Member

huonw commented Apr 3, 2013

I'm running tests on a patch that uses intrinsics for sqrt, sin, cos, exp, etc.

@thestinger
Copy link
Contributor

@dbaupp: Nice! It would be awesome to have all of the useful LLVM intrinsics for integers/floats exposed.

@huonw
Copy link
Member

huonw commented Apr 3, 2013

Yay! 5x faster:

use core::float::sqrt;

fn main() {
    let mut sum: float = 0.0;
    for core::uint::range(0, 100000000) |i| {
        sum += sqrt(i as float);
    }
    io::println(fmt!("%f", sum));
}
# using cmath
$ rustc --opt-level=3 intrinsics-bench.rs  && time ./intrinsics-bench 
666666661666.567017

real    0m2.729s
user    0m2.712s
sys     0m0.008s

# using intrinsics
$ x86_64-unknown-linux-gnu/stage2/bin/rustc --opt-level=3 intrinsics-bench.rs  && time ./intrinsics-bench 
666666661666.567017

real    0m0.513s
user    0m0.500s
sys     0m0.008s

@brson
Copy link
Contributor Author

brson commented Apr 3, 2013

@dbaupp mentioned on reddit that using some of these clobbers the stack canary. I hadn't considered that before and we need to think about how these are implemented. How many of the intrinsics that we are already using are just trotting all over the stack red zone? Perhaps LLVM's intrinsics don't emit the proper stack growth prologues.

@huonw
Copy link
Member

huonw commented Apr 3, 2013

(The very brief reddit thread, for future reference.)

Do I continue working on this? (i.e. open a pull request, since I've got the code basically organised.)

The (most) dangerous intrinsics seem to be sin, cos and exp, so I've left sin as libc and added sin_intr for the intrinsic etc.

@brson
Copy link
Contributor Author

brson commented Apr 3, 2013

@dbaupp I'd like to know what assembly they generate. Do you have a test case that demonstrates all of them?

huonw added a commit to huonw/rust that referenced this issue Apr 3, 2013
Achieves up to 5x speed up! However, the intrinsics seem to do bad
things to the stack, especially sin, cos and exp (rust-lang#5686 has
discussion).

Also, add f{32,64,loat}::powi, and reorganise the delegation code so
that functions have the #[inline(always)] annotation, and reduce the
repetition of delegate!(..).
@huonw
Copy link
Member

huonw commented Apr 3, 2013

@brson: #5697

@pcwalton
Copy link
Contributor

@huonw The necessary LLVM changes have been made.

However, I didn't realize that intrinsics can clobber the stack. I guess we need to write wrappers with the fixedstacksegment attribute for the problematic intrinsics (sin, cos) as well! This can be done entirely in the Rust compiler now, no need to patch LLVM.

@huonw
Copy link
Member

huonw commented Apr 20, 2013

@pcwalton Does fixed_stack_segment work on extern/intrinsic functions? i.e.

#[abi = "rust-intrinsic"]
pub extern "rust-intrinsic" {
   #[fixed_stack_segment]
   pub fn sinf32(x: f32) -> f32;
}

@pcwalton
Copy link
Contributor

@huonw Extern functions automatically require fixed stack segments, if the #[fast_ffi] attribute is used. (Eventually this attribute will no longer be necessary.) However, I don't think that it will work for functions with the rust-intrinsic ABI, because I didn't realize that some intrinsic functions would need it.

bors added a commit that referenced this issue Apr 20, 2013
…lton

This implements the fixed_stack_segment for items with the rust-intrinsic abi, and then uses it to make f32 and f64 use intrinsics where appropriate, but without overflowing stacks and killing canaries (cf. #5686 and #5697). Hopefully.

@pcwalton, the fixed_stack_segment implementation involved mirroring its implementation in `base.rs` in `trans_closure`, but without adding the `set_no_inline` (reasoning: that would defeat the purpose of intrinsics), which is possibly incorrect.

I'm a little hazy about how the underlying structure works, so I've annotated the 4 that have caused problems so far, but there's no guarantee that the other intrinsics are entirely well-behaved.

Anyway, it has good results (the following are just summing the result of each function for 1 up to 100 million):

```
$ ./intrinsics-perf.sh f32
func   new   old   speedup
sin    0.80  2.75  3.44
cos    0.80  2.76  3.45
sqrt   0.56  2.73  4.88
ln     1.01  2.94  2.91
log10  0.97  2.90  2.99
log2   1.01  2.95  2.92
exp    0.90  2.85  3.17
exp2   0.92  2.87  3.12
pow    6.95  8.57  1.23

   geometric mean: 2.97

$ ./intrinsics-perf.sh f64
func   new   old   speedup
sin    12.08  14.06  1.16
cos    12.04  13.67  1.14
sqrt   0.49  2.73  5.57
ln     4.11  5.59  1.36
log10  5.09  6.54  1.28
log2   2.78  5.10  1.83
exp    2.00  3.97  1.99
exp2   1.71  3.71  2.17
pow    5.90  7.51  1.27

   geometric mean: 1.72
```

So about 3x faster on average for f32, and 1.7x for f64. This isn't exactly apples to apples though, since this patch also adds #[inline(always)] to all the function definitions too, which possibly gives a speedup.

(fwiw, GitHub is showing 93c0888 after d9c54f8 (since I cherry-picked the latter from #5697), but git's order is the other way.)
@huonw
Copy link
Member

huonw commented Apr 21, 2013

@brson #5975 fixed this.

@brson
Copy link
Contributor Author

brson commented Apr 21, 2013

Thanks @huonw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants