Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in u128 performance compared to extprim crate #39078

Open
leonardo-m opened this Issue Jan 15, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@leonardo-m
Copy link

leonardo-m commented Jan 15, 2017

This is a performance bug report.

This little program solves a problem derived from this (a ten times larger problem):
https://projecteuler.net/problem=55

The solution (this is not the fastest solution for this problem, but it's a easy to understand one):

#![feature(i128_type)]

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

On my PC it runs in about 1.12 seconds.

Using an external crate for the u128 bit values, with the same code (extprim = "1.1.0"):

extern crate extprim;
use extprim::u128::u128;

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

This runs in about 0.72 seconds.

@est31

This comment has been minimized.

Copy link
Contributor

est31 commented Jan 15, 2017

On which target do you run this, and which one was it compiled for? Also, how does the to_string implementation look like? Is it different in the two examples?

@est31

This comment has been minimized.

Copy link
Contributor

est31 commented Jan 15, 2017

@leonardo-m

This comment has been minimized.

Copy link
Author

leonardo-m commented Jan 15, 2017

Right. I am using a 64 bit Windows GNU target.

@leonardo-m

This comment has been minimized.

Copy link
Author

leonardo-m commented Jan 15, 2017

I think this is the formatter for that crate:

impl fmt::Display for u128 {
    fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        if self.hi == 0 {
            self.lo.fmt(formatter)
        } else {
            const TEN19: u128 = u128 { lo: 10000000000000000000, hi: 0 };

            let mut buffer = [0u8; 39];
            let mut buf = FormatBuffer::new(&mut buffer);

            let (mid, lo) = div_rem(*self, TEN19);
            if mid.hi == 0 {
                try!(write!(&mut buf, "{}{:019}", mid.lo, lo.lo));
            } else {
                let (hi, mid) = div_rem(mid, TEN19);
                try!(write!(&mut buf, "{}{:019}{:019}", hi.lo, mid.lo, lo.lo));
            }

            formatter.pad_integral(true, "", unsafe { buf.into_str() })
        }
    }
}

Where the div_rem uses a function udivmodti4 from here:

https://github.com/kennytm/extprim/blob/master/src/compiler_rt.rs

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Jan 15, 2017

There’s at most two places where the difference in runtime could come:

  1. Implementation of the intrinsic function(s);
  2. Overhead due to ABI differences (unlikely to be 2x penalty).

Or some combination of two. While the linked udivmodti4 intrinsic seems fairly similar, one can spot a number of potentially relevant differences (also signature violations) as well.

perf also shows about 55% of runtime is spent on a few particularly hot instructions on my machine. Notably there’s a shr that’s as hot as the hottest divisions, which seems to suggest some pathological case. Some gentle massaging of the intrinsic could help to improve performance of such code immensely.

This issue is applicable to intrinsics in general; all of them could benefit from having some of their low hanging fruit plucked.

@est31

This comment has been minimized.

Copy link
Contributor

est31 commented Jan 15, 2017

There is at least one place to do a possible speed optimisation: https://github.com/rust-lang/rust/blob/master/src/libcompiler_builtins/lib.rs#L233

Now that we have a benchmark we can actually compare it. However it should also be applied to the libcompiler_builtins crate, as otherwise the time was wasted once that crate gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.