Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upBig performance problem with closed intervals looping #45222
Comments
This comment has been minimized.
This comment has been minimized.
|
Performing the low-hanging fruit for minimization: #![feature(inclusive_range_syntax)]
#![allow(private_no_mangle_fns)]
#[inline(never)]
#[no_mangle]
fn triangle_exc(n: u64) -> u64 {
let mut count = 0;
for j in (0 .. n + 1) {
count += j;
}
count
}
#[inline(never)]
#[no_mangle]
fn triangle_inc(n: u64) -> u64 {
let mut count = 0;
for j in 0 ..= n {
count += j;
}
count
}
fn main() {
let n: u64 = std::env::args().nth(1).unwrap().parse().unwrap();
println!("{}", triangle_exc(n));
println!("{}", triangle_inc(n));
}Good:
Bad:
Lost some kind of fast lane? |
This comment has been minimized.
This comment has been minimized.
|
Apparently llvm is way happier optimizing the open interval with simd . |
TimNN
added
C-enhancement
I-slow
labels
Oct 17, 2017
leonardo-m
referenced this issue
Nov 9, 2017
Open
Do not suggest to use the closed interval syntax #2217
kennytm
added
the
T-compiler
label
Jan 28, 2018
This comment has been minimized.
This comment has been minimized.
|
Referring to the example in #45222 (comment), the first code is recognized by LLVM loop-vectorize, while the second doesn't.
After vectorization the If we black-box the
We may see if upgrading to LLVM 6 can help this case. Also note that the two pieces of code are not equivalent: the |
This comment has been minimized.
This comment has been minimized.
|
I've checked again using the LLVM 6 build. The performance is improved, but there is still a large gap (2.6× → 2.0×). Vectorization is still not recognized for Timings$ rustc +nightly -C codegen-units=1 -C opt-level=3 -C target-cpu=native 1.rs
$ time ./1 30000 2
13500450000000
real 0m5.389s
user 0m5.136s
sys 0m0.012s
$ time ./1 30000 1
13500450000000
real 0m2.195s
user 0m2.192s
sys 0m0.000s
$ rustc +38bd38147d2fa21f8a684b019fc0763adf8fd436 -C codegen-units=1 -C opt-level=3 -C target-cpu=native 1.rs
$ time ./1 30000 2
13500450000000
real 0m4.445s
user 0m4.332s
sys 0m0.000s
$ time ./1 30000 1
13500450000000
real 0m2.330s
user 0m2.184s
sys 0m0.016s |
This comment has been minimized.
This comment has been minimized.
|
This might just be the classic external-iteration-is-slower-sometimes problem. Note that even Fix for internal iteration is up at #48012 |
scottmcm
referenced this issue
Feb 5, 2018
Open
Range* should overrride more methods of Iterator #39975
This comment has been minimized.
This comment has been minimized.
|
I believe this can be fixed by adding an extra field to struct FixedRangeInclusive {
start: u64,
end: u64,
done: bool,
}
fn fixed_range_inclusive(start: u64, end: u64) -> FixedRangeInclusive {
FixedRangeInclusive {
start,
end,
done: false,
}
}
impl Iterator for FixedRangeInclusive {
type Item = u64;
fn next(&mut self) -> Option<Self::Item> {
if !self.done {
if self.start == self.end {
self.done = true;
}
let new = self.start.wrapping_add(1);
Some(std::mem::replace(&mut self.start, new))
} else {
None
}
}
}Check out the assembly on the playground. |
This comment has been minimized.
This comment has been minimized.
|
The current two-field RangeInclusive follows from rust-lang/rfcs#1980. The |
This comment has been minimized.
This comment has been minimized.
|
I'm aware that |
This comment has been minimized.
This comment has been minimized.
This would be jarring, in consideration of the fact that It is unfortunate. Were this not the case I could almost picture something like this: pub struct RangeInclusive<T> {
// NOTE: not pub
start: T, // actually, these should probably be ManuallyDrop<T>
end: T, // or union MaybeUninit<T> { value: T, empty: () }
done: bool,
}
impl RangeInclusive<T> {
// Expose an API that matches the functionality of the enum type
#[inline] pub fn new(start: T, end: T) -> Self { ... }
#[inline] pub fn new_done() -> Self { ... }
#[inline] pub fn endpoints(&self) -> Option<(&T, &T)> { ... }
#[inline] pub fn endpoints_mut(&mut self) -> Option<(&mut T, &mut T)> { ... }
#[inline] pub fn into_endpoints(self) -> Option<(T, T)> { ... }
}and ISTM (note: haven't tested) that this should optimize just as well as the three-field struct, since it IS the three-field struct (just with statically enforced usage patterns). But I suppose that, even then, it would seem questionable to have a standard library type that simulates a enum (rather than being one) solely for performance concerns. Edit: I misread somewhat and thought that the enum was the current proposal. |
This comment has been minimized.
This comment has been minimized.
|
@leonardo-m Can you share some non-simple examples where the current form is a problem? #48012 will make it so that the example in here is fine if written the easier way For simple things like sums of iota, it's easy to get suboptimal codegen from all kinds of different iterators. Like using Edit: #48057 has also improved things in recent nightlies, though it's still not perfect. |
kennytm
referenced this issue
Feb 6, 2018
Closed
Tracking issue for `..=` inclusive ranges (RFC #1192) -- originally `...` #28237
added a commit
to kennytm/rust
that referenced
this issue
Feb 6, 2018
added a commit
that referenced
this issue
Feb 8, 2018
added a commit
that referenced
this issue
Feb 8, 2018
This was referenced Apr 14, 2018
This comment has been minimized.
This comment has been minimized.
|
For record: Enabling Polly (#50044) doesn't fix the issue. |
This comment has been minimized.
This comment has been minimized.
Stargateur
commented
May 11, 2018
•
|
@ollie27 Just a note, if you are going to use a impl Iterator for FixedRangeInclusive {
type Item = u64;
fn next(&mut self) -> Option<Self::Item> {
if self.start < self.end {
let new = self.start.wrapping_add(1);
Some(std::mem::replace(&mut self.start, new))
} else if !self.done {
self.done = true;
Some(self.start)
} else {
None
}
}
} |
added a commit
to kennytm/rust
that referenced
this issue
Jun 18, 2018
kennytm
referenced this issue
Jun 18, 2018
Merged
Change RangeInclusive to a three-field struct. #51622
added a commit
to kennytm/rust
that referenced
this issue
Jun 19, 2018
added a commit
to kennytm/rust
that referenced
this issue
Jun 19, 2018
added a commit
to kennytm/rust
that referenced
this issue
Jun 22, 2018
added a commit
to kennytm/rust
that referenced
this issue
Jun 30, 2018
added a commit
to kennytm/rust
that referenced
this issue
Jul 12, 2018
added a commit
that referenced
this issue
Jul 12, 2018
added a commit
to kennytm/rust
that referenced
this issue
Jul 13, 2018
added a commit
that referenced
this issue
Jul 13, 2018
added a commit
that referenced
this issue
Jul 13, 2018
bors
closed this
in
#51622
Jul 13, 2018
kennytm
referenced this issue
Dec 5, 2018
Merged
Replace usages of `..i + 1` ranges with `..=i`. #56516
This comment has been minimized.
This comment has been minimized.
leonardo-m
commented
Dec 6, 2018
|
To answer Issue #56516, this is a first example of the performance problem, better examples could follow. Code example from:
If I replace the open intervals with closed ones:
For the second version I am seeing a run-time of about 6 seconds. |
kennytm
reopened this
Dec 6, 2018
sinkuu
referenced this issue
Dec 6, 2018
Closed
Override <RangeInclusive as Iterator>::try_(r)fold #56563
This comment has been minimized.
This comment has been minimized.
leonardo-m
commented
Jan 1, 2019
|
Lot of discussion here: https://old.reddit.com/r/rust/comments/ab7hsi/comparing_pythagorean_triples_in_c_d_and_rust/ |
leonardo-m commentedOct 11, 2017
•
edited by kennytm
In my code I've essentially stopped using loops with ... (intervals closed on the right, recently written with the syntax ..=) because they give performance problems. This is a simple example that shows the problem:
Compiled with the last Nightly:
Compiled with:
rustc -O test.rsRunning it calling foo1 takes 0.02 seconds:
Running it calling foo2 takes about 13.65 seconds:
The asm I am seeing using "--emit asm"