Specialize .zip() for efficient slice and slice iteration #33090

bluss · 2016-04-19T12:48:37Z

The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop.

The implementation in the PR is internal and has no visible effect in the API

Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well.

TrustedRandomAccess

The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen.

// old zip before this PR: bad, byte-for-byte loop
// with specialized zip: memcpy
pub fn copy_zip(xs: &[u8], ys: &mut [u8]) {
    for (a, b) in ys.iter_mut().zip(xs) {
        *a = *b;
    }
}

// old zip before this PR: single addition per iteration
// with specialized zip: vectorized
pub fn add_zip(xs: &[f32], ys: &mut [f32]) {
    for (a, b) in ys.iter_mut().zip(xs) { *a += *b; }
}


// old zip before this PR: single addition per iteration
// with specialized zip: vectorized (!!)
pub fn add_zip3(xs: &[f32], ys: &[f32], zs: &mut [f32]) {
    for ((a, b), c) in zs.iter_mut().zip(xs).zip(ys) { *a += *b * *c; }
}

Non-null issues

Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside
Option<(&T, &T)> makes it a None value and a premature (in this case)
end of the loop.

So even if we have 1) an explicit Some in the code and 2) the types of the pointers are &T or &mut T which are nonnull, we can still get a phantom null check at that point.

One example that illustrates the difference is copy_zip with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy.

There was an experiment to use assume to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway.

Adaptors

These have conditionally implemented TrustedRandomAccess

Enumerate
Zip

These have not implemented it

Map is sideeffectful. The forward case would be workable, but the double ended case is complicated.
Chain, exact length semantics unclear
Filter, FilterMap, FlatMap and many others don't offer random access and/or exact length

rust-highfive · 2016-04-19T12:48:44Z

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

Stebalien · 2016-04-19T16:10:11Z

Why TrustedRandomAccessIterator instead of TrustedExactSizeIterator with a next_unchecked method? Wouldn't that be more general (TruestedRandomAccessIterator → TrustedExactSizeIterator but not the other way around)? Would it not optimize as well?

bluss · 2016-04-19T16:18:32Z

I haven't tried that. The counted loop motivation explains why I used this approach though.

Stebalien · 2016-04-19T16:27:18Z

I see. (Now to practice reading things from start to finish before asking questions...)

alexcrichton · 2016-04-19T18:15:13Z

Nice wins! The design here seems pretty clean, although I agree with @Stebalien that if we could get away with something like:

unsafe fn next_unchecked(&mut self) -> Self::Item;

Then that'd be pretty slick as well. (not sure if it optimizes the same way though).

It's a little unfortunate that the Zip struct is growing two words only used in some situations, but I wonder if we could specialize that as well? Something like:

struct Zip<A, B> {
    inner: <(A, B) as ZipImpl<A, B>>::Repr,
}

trait ZipImpl<A, B> {
    type Repr;
}

impl<A, B> ZipImpl<A, B> for (A, B) {
    type Repr = (A, B);
}

// ...

I'm not sure if that'll actually compile just yet, but perhaps worth a shot?

Also, could this add some benchmarks in-tree so eventually if we start tracking them we can track performance over time?

bluss · 2016-04-19T22:33:22Z

@Stebalien @alexcrichton I have tried similar things to .next_unchecked, but not with the rigor or the scale as in this PR. But in general it does not optimize the same way the indexed version does and does not reduce to memcpy.

Specialization with associated types seems controversial as it is. I'm not worried about the extra struct fields. In the absolute majority of use cases, they are locals in a function and can be optimized out if they are unused.

I will attempt to salvage Map by adding back the side effect case for uneven length iterators, and see if the performance of the test cases are still preserved.

I'll happily add the examples as benchmark cases, but for me they are more like codegen test cases. Do we have a facility for testing optimizations?

alexcrichton · 2016-04-19T22:56:45Z

But in general it does not optimize the same way the indexed version does and does not reduce to memcpy.

Ah yeah if LLVM can't connect the dots to realize everything uses the same iteration counter then seems fine to leave as is.

Specialization with associated types seems controversial as it is.

Could you elaborate a bit? Is it not implemented currently, or do you think it's likely to be removed? It's true that this would only be a minor win, I'm almost mostly just interested to see if it works :)

I'll happily add the examples as benchmark cases, but for me they are more like codegen test cases. Do we have a facility for testing optimizations?

We kinda do in src/test/codegen. You could do an operation and then assert that it's compiled to a memcpy or something like that, but unfortunately they're pretty brittle.

bluss · 2016-04-21T16:31:07Z

Here's a playground snippet where you can compare the counted strategy with the next_unchecked strategy.

https://play.rust-lang.org/?gist=fbb8dece16c32195eb60219b9d769871&version=nightly&backtrace=0

The implementation is of course not so beautiful, since we need to access private fields of the iterators from outside their module.

next_unchecked vectorizes, but it does not recognize as memcpy, and you can compare the addition loops' code against each other too. The kind of code generated for the addition loop by next_unchecked is generally a bit worse in benchmarks than the indexed approach.

bluss · 2016-04-21T19:47:55Z

The AssumeNonNull code is gone, because it regressed some of the simplest testcases (even if it improved many others).

This optimization is pretty brittle itself! I've added benchmarks, and they optimize as they should, but only after I moved the code out into separate functions. The codegen test also passes.

I think Map will not participate in this specialization. I planned to reproduce the needed side effects, but then I realized the next_back case requires so much more, I don't know how it can be done.

bluss · 2016-04-21T20:25:10Z

It's fragile and it depends on something that's outside the zip iterator's control.

Example: https://gist.github.com/bluss/5a088d3f420d12406689439e2940d731

Same input iterators but different sources (slice vs Vec).

bluss · 2016-04-22T16:54:19Z

Regarding the nonnull/noalias/optimization failures, after looking through this yesterday I think it's a very interesting issue but it's orthogonal to and an independent issue of this improvement of zip, even though the missing nonnull/noalias metadata readily destroys the optimization in many cases. A reliable workaround is to use an intermediate function with slice arguments (works well for ndarray for example). In the long run we must ensure that slices / slice iterators created directly from Vec variables or Vec function arguments are treated the same way slice function arguments are.

@alexcrichton I did one attempt for associated types and found some interesting effects. The implementation diff on top of the current state of the PR is here https://gist.github.com/bluss/ee1396d6b845f8eefa04ead1fa940533 and the compiler error that resulted is also in there. It simply looks like the compiler has to treat the default case as if it has a drop ck obligation. This is all pretty new to me, and I don't know what's correct and how it can be worked around.

I think the approach using Default is the way to go without the lattice rule. There's no way we can have it know the associated data is always () in the default implementation case.

aturon · 2016-04-22T17:30:38Z

@alexcrichton

Could you elaborate a bit? Is it not implemented currently, or do you think it's likely to be removed? It's true that this would only be a minor win, I'm almost mostly just interested to see if it works :)

To clarify the overall status: it seems pretty clear that we're going to want specializable associated types in some form (and today's implementation has them). We're already starting to collect some pretty significant use cases. The main question is precisely who gets to assume what about these types -- there are a few ways we can take the design, and we want to get feedback on the use cases they enable.

bluss · 2016-04-22T18:12:39Z

I found a use of associated types that compiles and works as before (optimizes the same). It's in the latest commit. It uses the trait bounds 'static + Default + Clone + Debug on the associated type. The 'static was the ingredient to avoid the dropck issue. Default to allow creating the field in the default impl, and Clone, Debug are to support the derive on Zip.

The trait owning the associated type is a separate trait implemented for every type, so that no new trait bounds are needed on the Zip struct.

alexcrichton · 2016-04-22T20:04:59Z

Nice!

I'm definitely in favor of merging :) (but will defer to libs triage at least)

bluss · 2016-04-22T20:23:03Z

It's in good shape for merging now. Avenues for expansion:

Could be expanded to a few more adaptors
The specialization trait could go pub unstable so that libcollections can implement it
Some brainstorming how to somehow bring in .map().

bluss · 2016-04-22T20:23:53Z

The dropck thing is probably a bug, though.

arielb1 · 2016-04-23T12:24:39Z

it seems pretty clear that we're going to want specializable associated types in some form (and today's implementation has them).

This still feels like a scary type-system feature to me - and lo and behold, the current implementation has tons of bugs.

If you ignore the orphan rules suddenly becoming Very Important, method specialization is basically equivalent to matching on types - which you can basically do (unreliably) even before specialization with sufficient unsafe code tricks.

Specialized associated types, however, add that kind of matching to the type system itself. We don't even have associated types working in all cases even without specialization - I doubt that adding it will make the situation better.

arielb1 · 2016-04-23T12:46:40Z

@aturon

We still don't know how to do type equality with HRTB and associated types (at least, I don't, and rustc doesn't do it correctly). Selecting types based on that sounds... bogus.

bstrie · 2016-04-26T05:45:51Z

Is this related to specializing to take advantage of a trusted size_hint? (I didn't see any issues open for such a thing.)

bluss · 2016-04-26T11:32:58Z

@bstrie It's not formally connected to it, but it could be. The specialization here has a much stricter demand (that you can access elements unchecked, by index).

alexcrichton · 2016-05-05T00:04:53Z

The libs team discussed this during triage today, and conclusion was positive all around. @aturon would like to scrutinize before landing, so I'll defer the final review to him.

bors · 2016-05-19T12:51:42Z

☔ The latest upstream changes (presumably #33694) made this pull request unmergeable. Please resolve the merge conflicts.

The associated type must be 'static to avoid dropck related errors.

bluss · 2016-06-17T08:07:17Z

Thank you. Sorry that I was away for some weeks. Let's give this a go (I'm not 100% that the memcpy test will pass the buildbots).

@bors r=aturon

bors · 2016-06-17T08:07:18Z

📌 Commit 5df05c6 has been approved by aturon

bors · 2016-06-17T08:07:20Z

⌛ Testing commit 5df05c6 with merge 0a35504...

alexcrichton · 2016-06-17T10:36:20Z

@bors: retry force clean

restarted buildbot

bors · 2016-06-17T10:36:33Z

⌛ Testing commit 5df05c6 with merge c8eff68...

Specialize .zip() for efficient slice and slice iteration The idea is to introduce a private trait TrustedRandomAccess and specialize .zip() for random access iterators into a counted loop. The implementation in the PR is internal and has no visible effect in the API Why a counted loop? To have each slice iterator compile to just a pointer, and both pointers are indexed with the same loop counter value in the generated code. When this succeeds, copying loops are readily recognized and replaced with memcpy and addition loops autovectorize well. The TrustedRandomAccess approach works very well on the surface. Microbenchmarks optimize well, following the ideas above, and that is a dramatic improvement of .zip()'s codegen. ```rust // old zip before this PR: bad, byte-for-byte loop // with specialized zip: memcpy pub fn copy_zip(xs: &[u8], ys: &mut [u8]) { for (a, b) in ys.iter_mut().zip(xs) { *a = *b; } } // old zip before this PR: single addition per iteration // with specialized zip: vectorized pub fn add_zip(xs: &[f32], ys: &mut [f32]) { for (a, b) in ys.iter_mut().zip(xs) { *a += *b; } } // old zip before this PR: single addition per iteration // with specialized zip: vectorized (!!) pub fn add_zip3(xs: &[f32], ys: &[f32], zs: &mut [f32]) { for ((a, b), c) in zs.iter_mut().zip(xs).zip(ys) { *a += *b * *c; } } ``` Yet in more complex situations, the .zip() loop can still fall back to its old behavior where phantom null checks throw in fake premature end of the loop conditionals. Remember that a NULL inside Option<(&T, &T)> makes it a `None` value and a premature (in this case) end of the loop. So even if we have 1) an explicit `Some` in the code and 2) the types of the pointers are `&T` or `&mut T` which are nonnull, we can still get a phantom null check at that point. One example that illustrates the difference is `copy_zip` with slice versus Vec arguments. The involved iterator types are exactly the same, but the Vec version doesn't compile down to memcpy. Investigating into this, the function argument metadata emitted to llvm plays the biggest role. As eddyb summarized, we need nonnull for the loop to autovectorize and noalias for it to replace with memcpy. There was an experiment to use `assume` to add a non-null assumption on each of the two elements in the specialized zip iterator, but this only helped in some of the test cases and regressed others. Instead I think the nonnull/noalias metadata issue is something we need to solve separately anyway. These have conditionally implemented TrustedRandomAccess - Enumerate - Zip These have not implemented it - Map is sideeffectful. The forward case would be workable, but the double ended case is complicated. - Chain, exact length semantics unclear - Filter, FilterMap, FlatMap and many others don't offer random access and/or exact length

bors · 2016-06-17T14:58:31Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-32cross-opt, auto-linux-64-cargotest, auto-linux-64-cross-armhf, auto-linux-64-cross-armsf, auto-linux-64-cross-freebsd, auto-linux-64-cross-netbsd, auto-linux-64-debug-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-opt-mir, auto-linux-64-opt-rustbuild, auto-linux-64-x-android-t, auto-linux-cross-opt, auto-linux-musl-64-opt, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-mac-64-opt-rustbuild, auto-mac-cross-ios-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-32-opt-rustbuild, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-cross-opt, auto-win-msvc-32-opt, auto-win-msvc-64-cargotest, auto-win-msvc-64-opt, auto-win-msvc-64-opt-mir, auto-win-msvc-64-opt-rustbuild

Stebalien · 2016-07-06T15:58:11Z

Unless I'm mistaken, this should also have implementations for Skip and Take.

frewsxcv · 2016-08-16T23:33:14Z

Could this have made Zip invariant? #35727

Stebalien · 2016-08-17T00:17:18Z

Could this have made Zip invariant? #35727

It is.

krdln · 2016-09-16T19:09:21Z

@bluss Could you elaborate on why exactly implementing this for Map (and Cloned) would be a problem? I fail to see what kind of side effects would pose a problem, can you give an example? (Also, is this comment thread a proper place to ask such a question?)

bluss · 2016-09-16T19:22:43Z

@krdln why not here, I don't mind.

I didn't know how to faithfully reproduce the side effects.

As an example, if you have two mapped iterators of length 4 and 3 like this: (0..4).map(f).zip((0..3).map(g))

The quirky part is that f will be called four times, and g three times. The fourth time is to extract the last element from the first iterator, before we realize there is no fourth element in the second iterator.

The specialized version computes a length 3 and only attempts to extract 3 elements from each iterator, as it is implemented now.

I remember i tried to replicate this in the specialized case. Maybe we can try again.

krdln · 2016-09-16T19:55:52Z

Oh, I totally didn't think about this edge case. So, by implementing this "naively" (by that I mean without replicating original behaviour), you would introduce an incosistency between the non-specialized and specialized version. So my question now is why exactly it's a problem?

Is it because of specialized version having different semantics? If that's really needed for correctness, then in fact the naive implementation is no-go.

Or is it because it may potentially break somebody's code, which relies on one more call to the closure? I would be really surprised if somebody was relying on this for correctness, especially as this behaviour is not documented anywhere. I would consider that change as breaking as changing the order of comparisons in sort implementation.

Anyway, I've mostly ask the question because of Cloned adapter, I was sad that my code wasn't autovectorizing and started to refactor it to unsafe manual indexing, when I've noticed that removing .cloned() made it vectorize! So, would it at least be a good idea to implement TrustedRandomAccess for Cloned where T: Copy? The docs say

More formally: if T: Copy, x: T, and y: &T, then let x = y.clone(); is equivalent to let x = *y;. Manual implementations should be careful to uphold this invariant; however, unsafe code must not rely on it to ensure memory safety.

so I guess it would be fine?

bluss · 2016-09-16T20:50:38Z

Cloned is much simpler than Map, I don't see much problem with saying that .clone() might not be called. Even simpler if the type is Copy.

I think that when we specialize we need to pay very close attention to the details and always try to make it seamless. Otherwise the code that builds on top of for example zip gets very hard to understand all of a sudden.

…ition, r=alexcrichton Remove data structure specialization for .zip() iterator Go back on half the specialization, the part that changed the Zip struct's fields themselves depending on the types of the iterators. Previous PR: rust-lang#33090 This means that the Zip iterator will always carry two usize fields, which are sometimes unused. If a whole for loop using a .zip() iterator is inlined, these are simply removed and have no effect. The same improvement for Zip of for example slice iterators remain, and they still optimize well. However, like when the specialization of zip was merged, the compiler is still very sensistive to the exact context. For example this code only autovectorizes if the function is used, not if the code in zip_sum_i32 is inserted inline where it was called: ```rust fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 { let mut s = 0; for (&x, &y) in xs.iter().zip(ys) { s += x * y; } s } fn zipdot_i32_default_zip(b: &mut test::Bencher) { let xs = vec![1; 1024]; let ys = vec![1; 1024]; b.iter(|| { zip_sum_i32(&xs, &ys) }) } ``` Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U. Fixes rust-lang#35727

…excrichton Remove data structure specialization for .zip() iterator Go back on half the specialization, the part that changed the Zip struct's fields themselves depending on the types of the iterators. Previous PR: #33090 This means that the Zip iterator will always carry two usize fields, which are sometimes unused. If a whole for loop using a .zip() iterator is inlined, these are simply removed and have no effect. The same improvement for Zip of for example slice iterators remain, and they still optimize well. However, like when the specialization of zip was merged, the compiler is still very sensistive to the exact context. For example this code only autovectorizes if the function is used, not if the code in zip_sum_i32 is inserted inline where it was called: ```rust fn zip_sum_i32(xs: &[i32], ys: &[i32]) -> i32 { let mut s = 0; for (&x, &y) in xs.iter().zip(ys) { s += x * y; } s } fn zipdot_i32_default_zip(b: &mut test::Bencher) { let xs = vec![1; 1024]; let ys = vec![1; 1024]; b.iter(|| { zip_sum_i32(&xs, &ys) }) } ``` Include a test that checks that `Zip<T, U>` is covariant w.r.t. T and U. Fixes #35727

rust-highfive assigned aturon Apr 19, 2016

bluss changed the title ~~WIP: Specialize .zip() for efficient slice+slice iteration~~ WIP: Specialize .zip() for efficient slice and slice iteration Apr 19, 2016

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 19, 2016

bluss force-pushed the special-zip-2 branch from a70bf9e to 419848f Compare April 21, 2016 19:35

bluss force-pushed the special-zip-2 branch 2 times, most recently from 90ff0a4 to 1346fa0 Compare April 22, 2016 15:45

bluss changed the title ~~WIP: Specialize .zip() for efficient slice and slice iteration~~ Specialize .zip() for efficient slice and slice iteration Apr 22, 2016

Stebalien mentioned this pull request Apr 28, 2016

[WIP] Implement FusedIterator #32999

Closed

bluss added 3 commits June 14, 2016 15:51

specialize zip: Add codegen test

c2ef20f

specialize zip: Add benchmarks

85cd49f

specialize zip: Use associated type for specialized zip struct data

5df05c6

The associated type must be 'static to avoid dropck related errors.

bluss force-pushed the special-zip-2 branch from d4f9dfc to 5df05c6 Compare June 14, 2016 15:12

bors merged commit 5df05c6 into rust-lang:master Jun 17, 2016

tafia mentioned this pull request Jul 25, 2016

Use more iter_rows and select_rows AtheMathmo/rusty-machine#107

Merged

bluss deleted the special-zip-2 branch September 15, 2016 07:59

bluss mentioned this pull request Sep 15, 2016

Remove data structure specialization for .zip() iterator #36490

Merged

sdroege mentioned this pull request Jan 4, 2018

Add slice::ExactChunks and ::ExactChunksMut iterators #47126

Merged

This was referenced Feb 19, 2021

Zip may call __iterator_get_unchecked twice with the same index #82291

Closed

Fix underflow in specialized ZipImpl::size_hint #82289

Merged

the8472 mentioned this pull request Apr 2, 2021

Zip's __iterator_get_unchecked doesn't preserve side-effects #82303

Closed

the8472 mentioned this pull request Jan 29, 2022

Use TrustedRandomAccess for loop desugaring #93243

Closed

Specialize .zip() for efficient slice and slice iteration #33090

Specialize .zip() for efficient slice and slice iteration #33090

Conversation

bluss commented Apr 19, 2016 • edited Loading

TrustedRandomAccess

Non-null issues

Adaptors

rust-highfive commented Apr 19, 2016

Stebalien commented Apr 19, 2016

bluss commented Apr 19, 2016

Stebalien commented Apr 19, 2016

alexcrichton commented Apr 19, 2016 • edited Loading

bluss commented Apr 19, 2016

alexcrichton commented Apr 19, 2016

bluss commented Apr 21, 2016

bluss commented Apr 21, 2016 • edited Loading

bluss commented Apr 21, 2016

bluss commented Apr 22, 2016 • edited Loading

aturon commented Apr 22, 2016

bluss commented Apr 22, 2016 • edited Loading

alexcrichton commented Apr 22, 2016

bluss commented Apr 22, 2016 • edited Loading

bluss commented Apr 22, 2016

arielb1 commented Apr 23, 2016 • edited Loading

arielb1 commented Apr 23, 2016

bstrie commented Apr 26, 2016

bluss commented Apr 26, 2016

alexcrichton commented May 5, 2016

bors commented May 19, 2016

bluss commented Jun 17, 2016

bors commented Jun 17, 2016

bors commented Jun 17, 2016

alexcrichton commented Jun 17, 2016

bors commented Jun 17, 2016

bors commented Jun 17, 2016

Stebalien commented Jul 6, 2016

frewsxcv commented Aug 16, 2016

Stebalien commented Aug 17, 2016

krdln commented Sep 16, 2016

bluss commented Sep 16, 2016 • edited Loading

krdln commented Sep 16, 2016 • edited Loading

bluss commented Sep 16, 2016

bluss commented Apr 19, 2016 •

edited

Loading

alexcrichton commented Apr 19, 2016 •

edited

Loading

bluss commented Apr 21, 2016 •

edited

Loading

bluss commented Apr 22, 2016 •

edited

Loading

bluss commented Apr 22, 2016 •

edited

Loading

bluss commented Apr 22, 2016 •

edited

Loading

arielb1 commented Apr 23, 2016 •

edited

Loading

bluss commented Sep 16, 2016 •

edited

Loading

krdln commented Sep 16, 2016 •

edited

Loading