Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI #97540

Closed
asquared31415 opened this issue May 30, 2022 · 23 comments · Fixed by #121668
Labels
A-abi Area: Concerning the application binary interface (ABI). A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@asquared31415
Copy link
Contributor

The following code:

pub fn bad() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

#[allow(improper_ctypes_definitions)]
pub extern "C" fn good() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

godbolt LLVM IR
godbolt ASM

Both functions should be identical, returning the 128-bit value in RAX and RDX per the sysv-64 ABI, which is the default for the environment godbolt uses by default.

Instead, when returning the value without specifying any ABI, the compiler chooses to return the value using an out pointer argument.

Note: This affects Result<usize, std::io::Error>, which is a particularly useful type and affects a significant portion of std::io, such as Read and Write

Meta

This has improved and regressed since 1.43.0, the first version I could find which would use 128 bits for the type.
All versions tested used the expected code generation for the extern "C" function.
All tests done using https://godbolt.org and no target overrides, which in practice is a x86_64 linux machine using the sysv-64 ABI.

..= 1.42.0: niche not used, cannot be compared
1.43.0 ..= 1.47.0: codegen like today, uses out ptr for return
1.48.0 ..= 1.60.0: returns using an LLVM i128, which produces the most desirable code on sysv-64
1.61.0 .. (current 2022-05-28 nightly): poor codegen that uses an out ptr for return

@rustbot label +regression-from-stable-to-stable +A-codegen

@asquared31415 asquared31415 added the C-bug Category: This is a bug. label May 30, 2022
@rustbot rustbot added A-codegen Area: Code generation regression-from-stable-to-stable Performance or correctness regression from one stable version to another. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 30, 2022
@thomcc thomcc added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 30, 2022
@thomcc
Copy link
Member

thomcc commented May 30, 2022

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

That said, this likely applies to all 128 bit Result types, so could have a non-negligible impact on overall performance.

@erikdesjardins
Copy link
Contributor

This is #26494 / #77434, reverted by #85265 / #94570

@asquared31415
Copy link
Contributor Author

asquared31415 commented May 30, 2022

This is interesting, because tuples or structs seem to have some exceptions for two usize-sized elements specifically. (u64, u64) and (u64, *mut ()) return by value, while (u32, u32, u32, u32), (u64, u32), and [u32; 4] (and all arrays) return with a pointer. Additionally structs containing two usize values exactly are returned by value (both with and without #[repr(C)].

The Result niche usage is specifically inhibiting this optimization, since otherwise the data returned would look like a (u64, *mut ()) in the Result<u64, core::ptr::NonNull<()>> or std::io::Result<usize> cases.

@nbdd0121
Copy link
Contributor

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

Performance regression is a regression. And I think this is a very bad one.

If I replace the Ok(0) with Ok(1) then it even starts to loading from memory.

I don't think this is related to niche though; Result<u64, core::num::NonZeroUsize> still have the good old codegen. Somehow Result<u64, core::ptr::NonNull<()>> ceased to be passed as a scalar pair, while Result<u64, core::num::NonZeroUsize> still does.

@asquared31415
Copy link
Contributor Author

Ah, NonZeroUsize or NonZeroU64 don't reproduce it, but NonZeroU{32, 16, 8} do return via an out ptr.

@nbdd0121
Copy link
Contributor

Okay, I located the reason. Abi::ScalarPair is only produced if all variant' data part are scalar and have the same abi::Primitive. NonZeroUsize and NonZeroU64 are all Primitive::Int(Integer::I64, false) which matches that of u64, so it's treated as a scalar pair. However NonZeroU{32,16,8} have different integer size, and NonNull have Primitive::Pointer which never matches with any integer, so ScalarPair is not produced and it's treated as Aggregate.

Before #94570, something this small is passed in registers, so it is optimized to be the same as if it's passed as scalar pair, but it's not, it's actually just an small aggregated passed directly. #94570 forces it to be an indirect passing, causing the perf regression.

Given the motivation of #94570 is to workaround LLVM inability to autovectorize things passed in register, I think a fix would to be use some heurstics to determine if a 2-usize type can be a target of autovectorization -- if so, pass indirectly, otherwise, pass by value.

@Urgau
Copy link
Member

Urgau commented May 30, 2022

@nbdd0121 Look at #93564 which was a more targeted fix. Maybe it could be reopened ?

EDIT: Looking into it right now.

@nbdd0121
Copy link
Contributor

Thanks for the pointer. That PR looks like a better approach to me. I think the compile-time perf regression in that PR results from excessive homogeneous_aggregate calls when the type is smaller or equal to pointer size (which includes almost all primitives!). I suspect that if you change the nesting of ifs, so that it's only called on types of size between 1usize and 2usize, then the perf regression should be mitigated.

@Urgau
Copy link
Member

Urgau commented May 30, 2022

@nbdd0121 I've reopened #93564 as #97559. I've confirmed that with the PR the regression is fixed and hopefully the perf regression of the PR is also fixed.

@apiraino
Copy link
Contributor

apiraino commented May 31, 2022

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high +T-compiler

@rustbot rustbot added P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 31, 2022
@workingjubilee
Copy link
Contributor

workingjubilee commented Jul 1, 2022

This issue only dubiously shows a codegen regression (it is Not Great, to be clear, but it's not clear from this example and my knowledge of how common x86 microarchitectures work that it's Actively Awful). If it is, it does not show it is a performance impact in practical cases, as it is still very few instructions and then a simple ret. This requires a benchmark to show what we are actually trying to improve on in terms of runtime, or comparison for multiple values, not just the zero cases, or even just something that shows that the impact is more significant than a xorps and mov where a xor would do.

@workingjubilee
Copy link
Contributor

An alternative would be experimenting more with this in context of other code we expect it to live in, because LLVM eagerly inlining can make problems go away, as cited here: #91447 (comment)

Small functions like this, in particular, will tend to be inlined by any backend, precisely because Rust functions are not made external by default. Thus they do not have to respect the System V ABI. Or any ABI, for that matter, and they can make things up as they go along. Constant folding becomes quite practical to do. So, many functions offer seemingly terrible codegen in isolation that doesn't actually come up as a problem in actual practice. This is much more of an issue when we're mostly touching how they do parameter passing, which is precisely what inlining can make disappear, which is why I bring it up.

@workingjubilee workingjubilee added P-medium Medium priority A-abi Area: Concerning the application binary interface (ABI). and removed P-high High priority I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 1, 2022
@schreter
Copy link

schreter commented Jul 7, 2022

BTW, in our project, we also noticed performance degradation because of this regression (small 16B Results returned via the stack instead of registers), so I vote for a quick fix. Ideally, all return values of up to 16B size should be returned in registers (at least on x64 and arm64 and at least for integer/pointer values, not sure about floats and the like). Thanks.

@workingjubilee
Copy link
Contributor

BTW, in our project, we also noticed performance degradation because of this regression (small 16B Results returned via the stack instead of registers), so I vote for a quick fix. Ideally, all return values of up to 16B size should be returned in registers (at least on x64 and arm64 and at least for integer/pointer values, not sure about floats and the like). Thanks.

I would sincerely appreciate it if this was quantified in some manner or if you could give a more complex example. It's not that I don't think you are correct, it's just that unfortunately the quickest fix is to simply revert the changes that regressed this outright. But reverting the changes will simply regress the performance in other areas, and the losses there were quite significant.

That's not acceptable, but this can still be fixed. We need to find something else instead as a solution, and more benchmarks or even a nuanced array of codegen examples means we are less likely to perform an overly sophomoric "fix" that could make things even worse for real code at the expense of improving academic cases. It will be fixed, but not with the quickest fix, except in the sense of making things faster.

Honestly, the question I have is why it isn't e.g. returned in a single xmm register instead. It doesn't make sense to me that LLVM would drop a value that can clearly already fit in a register on the stack, ever.

@nbdd0121
Copy link
Contributor

It's the rustc that makes return value indirect, instead of LLVM.

@nikic
Copy link
Contributor

nikic commented Jul 11, 2022

Honestly, the question I have is why it isn't e.g. returned in a single xmm register instead. It doesn't make sense to me that LLVM would drop a value that can clearly already fit in a register on the stack, ever.

The ABI is decided by rustc here, not LLVM. If rustc tells LLVM to use an sret argument, then it must do so.

Returning an i64 pair in an XMM register is also a bad idea because it requires GPR to XMM moves, even if we ignore issues related to the target-feature dependence of XMM registers.

@schreter
Copy link

I would sincerely appreciate it if this was quantified in some manner or if you could give a more complex example.

@workingjubilee Yes, we need to quantify this, but it's not that trivial. Our project is fairly big and the benchmark directly or indirectly touches 200K+ LOCs of our code and likely millions of LOCs of code from other crates, so it's not easy to quantify. I'd estimate the average performance loss 5% across the board from 1.60 -> 1.62.

We do use quite a few functions returning Results in tight loops (fallible value matchers, pushing results to bitmap sinks). I can try to create a manageable repro case with this, if it would help.

@workingjubilee
Copy link
Contributor

We do use quite a few functions returning Results in tight loops (fallible value matchers, pushing results to bitmap sinks). I can try to create a manageable repro case with this, if it would help.

Yes, one of the smaller loops that is like what you use, if it can show an even somewhat similar regression (may need generous application of std::hint::black_box), would be excellent.

The ABI is decided by rustc here, not LLVM. If rustc tells LLVM to use an sret argument, then it must do so.

Hm. Honestly, I guess that after the amount of noodling around I have done with LLVMIR and the way the LangRef phrases byval's documentation, implying it is only a polite suggestion, I didn't fully grok that LLVM really interprets sret as a true unconditional directive.

@schreter
Copy link

@workingjubilee I've tried to implement something with black_box, unfortunately, Rust happily optimizes it out, so I have to find a different approach (didn't have much time yet, sorry).

What we also see and that is probably at least equally big problem is the handling of larger Futures, where a lot of unnecessary memory copying is in play. After doing some Result-based optimizations, these memcpy() calls are the hottest function in the profile. What happens:

  • Calling an async function in fact calls a stub producing a Future (which is in fact a generator, which is later polled). This Future is stored in the parent's async function "stack", which is simply an aggregated Future of the parent function and the called function (with some layout optimizations).
  • Unfortunately, instead of telling the Future generator to materialize the Future directly into the proper place, the Future is materialized on the regular stack and then copied from the stack to the caller's Future.
  • Now, we have a loop calling a request handler which supports various request types (dispatched by a match statement) where one of them produces a large Future. Then, the call to the request handler materializes a Future by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the parent Future.

This wastes cycles like there is no tomorrow - instead of materializing the callee future on the stack, the async functions should materialize the callee future directly in-place in the caller's future. That would save the copying (and, especially, copying of still uninitialized space!).

I'm wondering if that has something to do with the Result issue as well or not. In any case, I couldn't find a specific issue for this problem in rust-lang issues. Are you aware of this problem already or not?

I'm definitely no expert in compiler building, but if I understand it correctly, the LLVM sret attribute takes a location where to materialize the return value. So it should be possible to specify the location inside of the caller's future instead of the stack as such (and thus greatly optimize async function start). If not, possibly, there is a different LLVM feature which could be used instead (or the ABI modelled differently, by passing a hidden parameter where to materialize the generator).

@thomcc
Copy link
Member

thomcc commented Aug 2, 2022

Note that this also happens for Result<SomePointer, SomeInteger>. I saw some suggestion that single-pointer Error types (std::io::Error, anyhow::Error, ThinBox<dyn Error>, ...) could work around this bug by storing their pointers in a NonZeroUsize when not running under miri. Ignoring the fact that this is gross, it would just move the problem so that you get the bad codegen for cases where Ok is a pointer/ref/etc instead (stuff like Result<&T, anyhow::Error>). So, just so nobody else is confused, this does not seem to be a case where you can abandon your morals in the name of performance in order to work around this, you'll just hit it in different cases.


Anyway, I do think it's kinda bad1. Most Rust code is very result heavy. I think quantifying this will be very hard though -- For example, rust code often looks like: https://godbolt.org/z/ETKfod3r5. The ptrs code is clearly worse than ints -- those reg/reg movs are free (on most modernish machines), but the stack ops are reads/writes2 to memory. Sadly, I'm not sure this is something that will be easy to measure...

Notably, black_box in particular would be a problem if used too aggressively -- it currently basically forces stack spilling of the argument (see #99899), so you'd need to use it in such a way that this overhead didn't overshadow the ABI wonkiness. Constructing a benchmark out of something like the linked godbolt but with more cases might be an okay thing to measure in a benchmark, with some caveats like measuring min time rather than avg (but I'd suspect it not to still kinda suck for all the normal reasons benchmarks suck3)

For codegen issues, something that models the CPU resources like llvm-mca may be the best bet, but it tends to be terrible at things with calls (so, as always, real measurements > artificial benchmarks). For this case, because they both have basically equivalent calls it might be fine, and MCA does seem to suggest that the ints code will perform decidedly better than ptrs, at least on the CPU it's simulating. (But yeah, don't read into how much less, due to the calls)

Footnotes

  1. Well, I'd expect this to be pretty small in terms of absolute impact, it just may be a small overhead to a fairly large amount of code.

  2. Well, the writes are implied, since they'd be in the callee.

  3. TBH, I've all but given up4 on artificial benchmarks at this point unless carefully constructed. Processors are getting very smart at making running the same code in a loop repeatedly (the way you do in a benchmark) an inaccurate way to measure time that that code will normally take to execute.

  4. Which is to say I still use them, they just make me sad.

@workingjubilee workingjubilee added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 6, 2022
@workingjubilee
Copy link
Contributor

Hmm. It kinda sucks that we don't use higher-level information from MIR or thereabouts to annotate types with "wants vectorization" and "doesn't want vectorization". MIR is about where we know which cases of (isize, isize) would prefer to go into a vector register and which would be just as happy being passed in two general registers.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 27, 2024
Represent `Result<usize, Box<T>>` as ScalarPair(i64, ptr)

This allows types like `Result<usize, std::io::Error>` (and integers of differing sign, e.g. `Result<u64, i64>`) to be passed in a pair of registers instead of through memory, like `Result<u64, u64>` or `Result<Box<T>, Box<U>>` are today.

Fixes rust-lang#97540.

r? `@ghost`
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 6, 2024
…m,oli-obk

Represent `Result<usize, Box<T>>` as ScalarPair(i64, ptr)

This allows types like `Result<usize, std::io::Error>` (and integers of differing sign, e.g. `Result<u64, i64>`) to be passed in a pair of registers instead of through memory, like `Result<u64, u64>` or `Result<Box<T>, Box<U>>` are today.

Fixes rust-lang#97540.

r? `@ghost`
@riking
Copy link

riking commented Mar 13, 2024

note about #121668 that confused me a couple times:

The integer part holds the discriminant, and the ptr holds the data in both cases.

@bors bors closed this as completed in 3cbb932 Mar 13, 2024
@josephlr
Copy link
Contributor

josephlr commented Mar 21, 2024

I'm looking at #121668 and trying to figure out why something like Result<&[u8], usize> still doesn't get an ABI of ScalarPair while something like Result<&[u8], ()> does get a ScalarPair ABI. Both types have a size of 16 and an alignment of 8. Furthermore, both store either 0 or a pointer in the first 8 bytes and store a usize in the second 8 bytes.

I used #[rustc_layout(debug)] to get the layout of both types, and for Result<&[u8], ()> I get:

Layout (ABI is ScalarPair)

Layout {
    size: Size(16 bytes),
    align: AbiAndPrefAlign {
        abi: Align(8 bytes),
        pref: Align(8 bytes),
    },
    abi: ScalarPair(
        Initialized {
            value: Pointer(
                AddressSpace(
                    0,
                ),
            ),
            valid_range: (..=0) | (1..),
        },
        Union {
            value: Int(
                I64,
                false,
            ),
        },
    ),
    fields: Arbitrary {
        offsets: [
            Size(0 bytes),
        ],
        memory_index: [
            0,
        ],
    },
    largest_niche: None,
    variants: Multiple {
        tag: Initialized {
            value: Pointer(
                AddressSpace(
                    0,
                ),
            ),
            valid_range: (..=0) | (1..),
        },
        tag_encoding: Niche {
            untagged_variant: 0,
            niche_variants: 1..=1,
            niche_start: 0,
        },
        tag_field: 0,
        variants: [
            Layout {
                size: Size(16 bytes),
                align: AbiAndPrefAlign {
                    abi: Align(8 bytes),
                    pref: Align(8 bytes),
                },
                abi: ScalarPair(
                    Initialized {
                        value: Pointer(
                            AddressSpace(
                                0,
                            ),
                        ),
                        valid_range: 1..=18446744073709551615,
                    },
                    Initialized {
                        value: Int(
                            I64,
                            false,
                        ),
                        valid_range: 0..=18446744073709551615,
                    },
                ),
                fields: Arbitrary {
                    offsets: [
                        Size(0 bytes),
                    ],
                    memory_index: [
                        0,
                    ],
                },
                largest_niche: Some(
                    Niche {
                        offset: Size(0 bytes),
                        value: Pointer(
                            AddressSpace(
                                0,
                            ),
                        ),
                        valid_range: 1..=18446744073709551615,
                    },
                ),
                variants: Single {
                    index: 0,
                },
                max_repr_align: None,
                unadjusted_abi_align: Align(8 bytes),
            },
            Layout {
                size: Size(0 bytes),
                align: AbiAndPrefAlign {
                    abi: Align(1 bytes),
                    pref: Align(8 bytes),
                },
                abi: Aggregate {
                    sized: true,
                },
                fields: Arbitrary {
                    offsets: [
                        Size(0 bytes),
                    ],
                    memory_index: [
                        0,
                    ],
                },
                largest_niche: None,
                variants: Single {
                    index: 1,
                },
                max_repr_align: None,
                unadjusted_abi_align: Align(1 bytes),
            },
        ],
    },
    max_repr_align: None,
    unadjusted_abi_align: Align(8 bytes),
}

However, for Result<&[u8], usize> I get:

Layout (ABI is Aggregate)
Layout {
    size: Size(16 bytes),
    align: AbiAndPrefAlign {
        abi: Align(8 bytes),
        pref: Align(8 bytes),
    },
    abi: Aggregate {
        sized: true,
    },
    fields: Arbitrary {
        offsets: [
            Size(0 bytes),
        ],
        memory_index: [
            0,
        ],
    },
    largest_niche: None,
    variants: Multiple {
        tag: Initialized {
            value: Pointer(
                AddressSpace(
                    0,
                ),
            ),
            valid_range: (..=0) | (1..),
        },
        tag_encoding: Niche {
            untagged_variant: 0,
            niche_variants: 1..=1,
            niche_start: 0,
        },
        tag_field: 0,
        variants: [
            Layout {
                size: Size(16 bytes),
                align: AbiAndPrefAlign {
                    abi: Align(8 bytes),
                    pref: Align(8 bytes),
                },
                abi: ScalarPair(
                    Initialized {
                        value: Pointer(
                            AddressSpace(
                                0,
                            ),
                        ),
                        valid_range: 1..=18446744073709551615,
                    },
                    Initialized {
                        value: Int(
                            I64,
                            false,
                        ),
                        valid_range: 0..=18446744073709551615,
                    },
                ),
                fields: Arbitrary {
                    offsets: [
                        Size(0 bytes),
                    ],
                    memory_index: [
                        0,
                    ],
                },
                largest_niche: Some(
                    Niche {
                        offset: Size(0 bytes),
                        value: Pointer(
                            AddressSpace(
                                0,
                            ),
                        ),
                        valid_range: 1..=18446744073709551615,
                    },
                ),
                variants: Single {
                    index: 0,
                },
                max_repr_align: None,
                unadjusted_abi_align: Align(8 bytes),
            },
            Layout {
                size: Size(16 bytes),
                align: AbiAndPrefAlign {
                    abi: Align(8 bytes),
                    pref: Align(8 bytes),
                },
                abi: Aggregate {
                    sized: true,
                },
                fields: Arbitrary {
                    offsets: [
                        Size(8 bytes),
                    ],
                    memory_index: [
                        0,
                    ],
                },
                largest_niche: None,
                variants: Single {
                    index: 1,
                },
                max_repr_align: None,
                unadjusted_abi_align: Align(8 bytes),
            },
        ],
    },
    max_repr_align: None,
    unadjusted_abi_align: Align(8 bytes),
}

The only main difference between these types (other than being assigned a different ABI) is the layout.fields.offsets of the variants. Specifically:

  • Result<&[u8], ()>
    • Ok variant has offset Size(0 bytes)
    • Err variant has offset Size(0 bytes)
    • Gets ScalarPair ABI
  • Result<&[u8], usize>
    • Ok variant has offset Size(0 bytes)
    • Err variant has offset Size(8 bytes)
    • Gets Aggregate ABI
  • Result<usize, usize>
    • Ok variant has offset Size(8 bytes)
    • Err variant has offset Size(8 bytes)
    • Gets ScalarPair ABI

Does anyone know if it would be possible for Result<&[u8], usize> to have a ScalarPair ABI? If so how would layout_of_enum need to be changed to support such a thing? I figure it's somewhere in this for loop

for (field_layouts, layout_variant) in iter::zip(variants, &layout_variants) {

but I'm not sure exactly how to make the change.

EDIT: It seems like that for loop is only used for the tagged_layout, but the niche_filling_layout is computed separately. So it seems like layout_of_enum would need a more substantial change to use the ScalarPair. Specifically, we would need to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-abi Area: Concerning the application binary interface (ABI). A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet