Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal codegen for f32 tuples #32045

Closed
bsteinb opened this issue Mar 4, 2016 · 6 comments · Fixed by #40658
Closed

Sub-optimal codegen for f32 tuples #32045

bsteinb opened this issue Mar 4, 2016 · 6 comments · Fixed by #40658
Labels
A-codegen Area: Code generation

Comments

@bsteinb
Copy link

bsteinb commented Mar 4, 2016

Similar to the behavior observed in #32031, tuples of f32 and f64 seem to be passed to functions in GPRs.

The f32 tuple takes an especially large hit, since the two f32 are passed inside a single 64 bit GPR and have to be excracted and compressed via shift and or instructions. Even with inlining turned on, this does not go away.

The f64 tuple is not as bad as the f32 tuple. Without inlining it does some moves to and from the SIMD registers and with inlining turned on, the tuple is kept in a SIMD register and the loop is vectorized and unrolled.

EDIT: Forgot to link to the code example on playpen.

@Aatch
Copy link
Contributor

Aatch commented Mar 6, 2016

The reason for this is due to us passing composite types less than a word in size as integers. In the vast majority of cases this results in much better code generation, so this is an unfortunate edge case where that isn't true.

The only real thing I can see us doing here is passing two-field structs as two-arguments, as long as both arguments fit into registers, something we should probably be doing anyway.

@bsteinb
Copy link
Author

bsteinb commented Mar 6, 2016

Would it make sense to introduce a special case for composite types made up only of floating point types? They could be passed inside SIMD registers.

@dotdash
Copy link
Contributor

dotdash commented Mar 7, 2016

Yes, this should get passed as a 2xf32 vector IIRC. Same thing that the C
ABI does basically.
Am 06.03.2016 23:50 schrieb "Benedikt Steinbusch" <notifications@github.com

:

Would it make sense to introduce a special case for composite types made
up only of floating point types? They could be passed inside SIMD registers.


Reply to this email directly or view it on GitHub
#32045 (comment).

@sanxiyn sanxiyn added the A-codegen Area: Code generation label Mar 7, 2016
@bsteinb
Copy link
Author

bsteinb commented Apr 9, 2016

Hey, the situation here has changed slightly in the meantime. Without inlining the compiler (nightly on playpen) still crams the two f32 into a single 64 bit GPR. With inlining, it sticks each f32 in its own GPR, and for each loop iteration moves each of them to a SIMD register to perform two additions and then moves them back into the GPR, omitting the shift and or dance.

@dsprenkels
Copy link
Contributor

dsprenkels commented Jun 18, 2016

I would like to take this issue, and see if I can fix it.

My proposal is to just make an exception for aggregate types containing (only) floating point types when casting to an integer type. I am however not sure what to do with heterogeneous aggregate types (containing both f32 and f64). I feel that it would be fine to leave them just as is, preventing the shift-or dance. Modern processors have more than enough xmm registers anyway.

@dsprenkels
Copy link
Contributor

I have now looked at this issue for some time. I think "Rust" functions could be adjusted for the "C" ABI (code), which would fix the sub-optimal codegen.

This however has some side effects, for which I do not have the experience to fix.
It also breaks a lot of codegen test cases. In most of these cases I do not really know if the codegen has just become better or equivalent, or if a test case purposely breaks.

cc @eddyb

bors added a commit that referenced this issue Apr 9, 2017
Use ty::layout for ABI computation instead of LLVM types.

This is the first step in creating a backend-agnostic library for computing call ABI details from signatures.
I wanted to open the PR *before* attempting to move `cabi_*` from trans to avoid rebase churn in #39999.
**EDIT**: As I suspected, #39999 needs this PR to fully work (see #39999 (comment)).

The first 3 commits add more APIs to `ty::layout` and replace non-ABI uses of `sizing_type_of`.
These APIs are probably usable by other backends, and miri too (cc @stoklund @solson).

The last commit rewrites `rustc_trans::cabi_*` to use `ty::layout` and new `rustc_trans::abi` APIs.
Also, during the process, a couple trivial bugs were identified and fixed:
* `msp430`, `nvptx`, `nvptx64`: type sizes *in bytes* were compared with `32` and `64`
* `x86` (`fastcall`): `f64` was incorrectly not treated the same way as `f32`

Although not urgent, this PR also uses the more general "homogenous aggregate" logic to fix #32045.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants