Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flat_map fails to autovectorize internal iterators #104914

Open
james7132 opened this issue Nov 25, 2022 · 8 comments
Open

flat_map fails to autovectorize internal iterators #104914

james7132 opened this issue Nov 25, 2022 · 8 comments
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@james7132
Copy link

james7132 commented Nov 25, 2022

#[no_mangle]
fn iter(query: &mut [&mut [f32]]) {
    for segment in query {
        for value in segment.iter_mut() {
            *value += 1.0;
        }
    }
}

#[no_mangle]
fn flat_map(query: &mut [&mut [f32]]) {
    let iter = query.iter_mut().flat_map(|segment| segment.iter_mut());
    for value in iter {
        *value += 1.0;
    }
}

I expected to see this happen: Both versions of the function to autovectorize.

Instead, this happened: only the normal for-loop version autovectorizes. See https://godbolt.org/z/caTqoq3x3.

Meta

rustc --version --verbose:

rustc 1.65.0 (897e37553 2022-11-02)
binary: rustc
commit-hash: 897e37553bba8b42751c67658967889d11ecd120
commit-date: 2022-11-02
host: x86_64-pc-windows-msvc
release: 1.65.0
LLVM version: 15.0.0

Also happens on nightly and for other "flattened" iterators. For example, Bevy's QueryIter and Query::for_each show this exact API split due to this performance difference.

There seems to be #87411 that was targeting something similar.

@james7132 james7132 added the C-bug Category: This is a bug. label Nov 25, 2022
@the8472
Copy link
Member

the8472 commented Nov 25, 2022

#[no_mangle]
fn flat_map(query: &mut [&mut [f32]]) {
    let iter = query.iter_mut().flat_map(|segment| segment.iter_mut());
    for value in iter {
        *value += 1.0;
    }
}

That's still external iteration. If you want internal iteration use

fn flat_map(query: &mut [&mut [f32]]) {
    query.iter_mut().flat_map(AsMut::as_mut).for_each(|value| *value += 1.0);
}

which does vectorize.

@james7132
Copy link
Author

Is there a way for us to get it to vectorize without internal iteration? The aforementioned case in Bevy cannot be separated like flat_map, and we'd rather encourage use of external iteration as it's generally more idiomatic.

@the8472
Copy link
Member

the8472 commented Nov 26, 2022

Well, for_each was more or less introduced for that purpose. The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration. There are some LLVM passes that are disabled even on O3 and the Polly plugin that can do some additional loop transformations but I assume there's a reason why they're off.

Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.

@james7132
Copy link
Author

james7132 commented Nov 26, 2022

The issue is the inner loop in Flatten's next() which makes it hard to optimize external iteration.

The implementation we have looks very similar.

Other than that you could see if you can structure your data to return arrays or an iterator-of-iterators so users could write their own nested loops.

This unfortunately is a 100% deal breaker since that fundamentally breaks the UX we're aiming to expose, and potentially forces unsafe into userspace code due to the underlying storage being type erased. We do have an internal iteration version, as mentioned before, and it does properly vectorize. However, its use is heavily discouraged right now as it's not extensible nor idiomatic. We were hoping to close this perf gap and remove the internal iteration option.

@Rageking8
Copy link
Contributor

@rustbot label +A-codegen +I-slow

@rustbot rustbot added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. labels Nov 26, 2022
@the8472
Copy link
Member

the8472 commented Nov 26, 2022

Query::for_each

This isn't an iterator method. You'll want to override the default try_fold and fold impls on your Iterator, similar to what various iterators in core do. That way your users will be able to use iterators and benefit from internal iteration when they call iterator methods that use internal iteration. That'll compose and provide better performance.

@james7132
Copy link
Author

This isn't an iterator method.

The intent was largely for UX (.iter().for_each() takes more to write out and screws with formatting) while also providing improved performance via internal iteration.

You'll want to override the default try_fold and fold impls on your Iterator, similar to what various iterators in core do.

Is it possible to implement those functions in stable Rust right now? I'm getting an error saying that try_trait_v2 is unstable.


Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.

@the8472
Copy link
Member

the8472 commented Nov 26, 2022

Well, at least fold should be possible to implement on stable rust, which will be used by for_each and some other methods.

Again ideally we wouldn't need this and the external iteration option optimizes to the same result. This isn't what I would call a zero-cost abstraction if users need to make a conscious choice to choose one over the other purely for performance purposes.

Yes, that's a long-known issue (see the blog post linked previously) but there aren't any good plans how to fix it.

github-merge-queue bot pushed a commit to bevyengine/bevy that referenced this issue Nov 28, 2023
… Iterator combinators (#6773)

# Objective
After #6547, `Query::for_each` has been capable of automatic
vectorization on certain queries, which is seeing a notable (>50% CPU
time improvements) for iteration. However, `Query::for_each` isn't
idiomatic Rust, and lacks the flexibility of iterator combinators.

Ideally, `Query::iter` and friends should be able to achieve the same
results. However, this does seem to blocked upstream
(rust-lang/rust#104914) by Rust's loop optimizations.

## Solution
This is an intermediate solution and refactor. This moves the
`Query::for_each` implementation onto the `Iterator::fold`
implementation for `QueryIter` instead. This should result in the same
automatic vectorization optimization on all `Iterator` functions that
internally use fold, including `Iterator::for_each`, `Iterator::count`,
etc.

With this, it should close the gap between the two completely.
Internally, this PR changes `Query::for_each` to use
`query.iter().for_each(..)` instead of the duplicated implementation.

Separately, the duplicate implementations of internal iteration (i.e.
`Query::par_for_each`) now use portions of the current `Query::for_each`
implementation factored out into their own functions.

This also massively cleans up our internal fragmentation of internal
iteration options, deduplicating the iteration code used in `for_each`
and `par_iter().for_each()`.

---

## Changelog
Changed: `Query::for_each`, `Query::for_each_mut`, `Query::for_each`,
and `Query::for_each_mut` have been moved to `QueryIter`'s
`Iterator::for_each` implementation, and still retains their performance
improvements over normal iteration. These APIs are deprecated in 0.12
and will be removed in 0.13.

---------

Co-authored-by: JoJoJet <21144246+JoJoJet@users.noreply.github.com>
github-merge-queue bot pushed a commit to bevyengine/bevy that referenced this issue Dec 1, 2023
… Iterator combinators (#6773)

# Objective
After #6547, `Query::for_each` has been capable of automatic
vectorization on certain queries, which is seeing a notable (>50% CPU
time improvements) for iteration. However, `Query::for_each` isn't
idiomatic Rust, and lacks the flexibility of iterator combinators.

Ideally, `Query::iter` and friends should be able to achieve the same
results. However, this does seem to blocked upstream
(rust-lang/rust#104914) by Rust's loop optimizations.

## Solution
This is an intermediate solution and refactor. This moves the
`Query::for_each` implementation onto the `Iterator::fold`
implementation for `QueryIter` instead. This should result in the same
automatic vectorization optimization on all `Iterator` functions that
internally use fold, including `Iterator::for_each`, `Iterator::count`,
etc.

With this, it should close the gap between the two completely.
Internally, this PR changes `Query::for_each` to use
`query.iter().for_each(..)` instead of the duplicated implementation.

Separately, the duplicate implementations of internal iteration (i.e.
`Query::par_for_each`) now use portions of the current `Query::for_each`
implementation factored out into their own functions.

This also massively cleans up our internal fragmentation of internal
iteration options, deduplicating the iteration code used in `for_each`
and `par_iter().for_each()`.

---

## Changelog
Changed: `Query::for_each`, `Query::for_each_mut`, `Query::for_each`,
and `Query::for_each_mut` have been moved to `QueryIter`'s
`Iterator::for_each` implementation, and still retains their performance
improvements over normal iteration. These APIs are deprecated in 0.13
and will be removed in 0.14.

---------

Co-authored-by: JoJoJet <21144246+JoJoJet@users.noreply.github.com>
Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
github-merge-queue bot pushed a commit to bevyengine/bevy that referenced this issue Dec 1, 2023
… Iterator combinators (#6773)

# Objective
After #6547, `Query::for_each` has been capable of automatic
vectorization on certain queries, which is seeing a notable (>50% CPU
time improvements) for iteration. However, `Query::for_each` isn't
idiomatic Rust, and lacks the flexibility of iterator combinators.

Ideally, `Query::iter` and friends should be able to achieve the same
results. However, this does seem to blocked upstream
(rust-lang/rust#104914) by Rust's loop optimizations.

## Solution
This is an intermediate solution and refactor. This moves the
`Query::for_each` implementation onto the `Iterator::fold`
implementation for `QueryIter` instead. This should result in the same
automatic vectorization optimization on all `Iterator` functions that
internally use fold, including `Iterator::for_each`, `Iterator::count`,
etc.

With this, it should close the gap between the two completely.
Internally, this PR changes `Query::for_each` to use
`query.iter().for_each(..)` instead of the duplicated implementation.

Separately, the duplicate implementations of internal iteration (i.e.
`Query::par_for_each`) now use portions of the current `Query::for_each`
implementation factored out into their own functions.

This also massively cleans up our internal fragmentation of internal
iteration options, deduplicating the iteration code used in `for_each`
and `par_iter().for_each()`.

---

## Changelog
Changed: `Query::for_each`, `Query::for_each_mut`, `Query::for_each`,
and `Query::for_each_mut` have been moved to `QueryIter`'s
`Iterator::for_each` implementation, and still retains their performance
improvements over normal iteration. These APIs are deprecated in 0.13
and will be removed in 0.14.

---------

Co-authored-by: JoJoJet <21144246+JoJoJet@users.noreply.github.com>
Co-authored-by: Alice Cecile <alice.i.cecile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants