Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return values up to 128 bits in registers #76986

Merged
merged 2 commits into from Sep 27, 2020
Merged

Return values up to 128 bits in registers #76986

merged 2 commits into from Sep 27, 2020

Conversation

jonas-schievink
Copy link
Contributor

@jonas-schievink jonas-schievink commented Sep 20, 2020

This fixes #26494 (comment) by making Rust's default ABI pass return values up to 128 bits in size in registers, just like the System V ABI.

The result is that these methods from the comment linked above now generate the same code, making the Rust ABI as efficient as the "C" ABI:

pub struct Stats { x: u32, y: u32, z: u32, }

pub extern "C" fn sum_c(a: &Stats, b: &Stats) -> Stats {
    return Stats {x: a.x + b.x, y: a.y + b.y, z: a.z + b.z };
}

pub fn sum_rust(a: &Stats, b: &Stats) -> Stats {
    return Stats {x: a.x + b.x, y: a.y + b.y, z: a.z + b.z };
}
sum_rust:
	movl	(%rsi), %eax
	addl	(%rdi), %eax
	movl	4(%rsi), %ecx
	addl	4(%rdi), %ecx
	movl	8(%rsi), %edx
	addl	8(%rdi), %edx
	shlq	$32, %rcx
	orq	%rcx, %rax
	retq

@rust-highfive
Copy link
Collaborator

r? @estebank

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 20, 2020
@jonas-schievink
Copy link
Contributor Author

r? @nagisa perhaps? cc @eddyb

@rust-highfive rust-highfive assigned nagisa and unassigned estebank Sep 20, 2020
@Mark-Simulacrum
Copy link
Member

Is there a reason not to do this on all x86_64 architectures?

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Sep 20, 2020

⌛ Trying commit 1be289d9c76441135b72186b133e949411657024 with merge 90ed0d04e2f640ecd65d3b39ffd37f7fd9b3cd25...

@bors
Copy link
Contributor

bors commented Sep 20, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 90ed0d04e2f640ecd65d3b39ffd37f7fd9b3cd25 (90ed0d04e2f640ecd65d3b39ffd37f7fd9b3cd25)

@rust-timer
Copy link
Collaborator

Queued 90ed0d04e2f640ecd65d3b39ffd37f7fd9b3cd25 with parent 1fd5b9d, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (90ed0d04e2f640ecd65d3b39ffd37f7fd9b3cd25): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@jonas-schievink
Copy link
Contributor Author

Hey that's neat, the improvements are in the compiler itself, so this seems to have a practical impact! It looks like most regressions are in LLVM, which is somewhat expected, since it might deal differently with the code now.

@calebsander
Copy link
Contributor

pub struct Stats { x: u32, y: u32, z: u32, }

pub extern "C" fn sum_c(a: &Stats, b: &Stats) -> Stats {
    return Stats {x: a.x + b.x, y: a.y + b.y, z: a.z + b.z };
}
example::sum_c:
        mov     rax, qword ptr [rsi]
        add     rax, qword ptr [rdi]
        mov     edx, dword ptr [rsi + 8]
        add     edx, dword ptr [rdi + 8]
        mov     cl, byte ptr [rsi + 12]
        add     cl, byte ptr [rdi + 12]
        movzx   ecx, cl
        shl     rcx, 32
        or      rdx, rcx
        ret

The generated assembly here doesn't seem to match the code. It looks like you've used struct Stats { x: u64, y: u32, z: u8 } instead? The assembly I get (using the posted code) is:

example::sum_c:
        mov     eax, dword ptr [rsi]
        add     eax, dword ptr [rdi]
        mov     ecx, dword ptr [rsi + 4]
        add     ecx, dword ptr [rdi + 4]
        mov     edx, dword ptr [rsi + 8]
        add     edx, dword ptr [rdi + 8]
        shl     rcx, 32
        or      rax, rcx
        ret

@jonas-schievink
Copy link
Contributor Author

@calebsander ah, sorry, my mistake. Updated the snippet.

@pftbest
Copy link
Contributor

pftbest commented Sep 21, 2020

Why not make this change for all 64bit platforms? All major ones do this in C: https://godbolt.org/z/ds1ezh

@jonas-schievink
Copy link
Contributor Author

Because I only know x86_64

@nagisa
Copy link
Member

nagisa commented Sep 26, 2020

Can you add a codegen test for this that verifies we don't materialize the return value into stack/memory?

r=me after that.

@jonas-schievink
Copy link
Contributor Author

@bors r=nagisa

@bors
Copy link
Contributor

bors commented Sep 26, 2020

📌 Commit cc2ba3b has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 26, 2020
@bors
Copy link
Contributor

bors commented Sep 27, 2020

⌛ Testing commit cc2ba3b with merge 62fe055...

@bors
Copy link
Contributor

bors commented Sep 27, 2020

☀️ Test successful - checks-actions, checks-azure
Approved by: nagisa
Pushing 62fe055 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Sep 27, 2020
@bors bors merged commit 62fe055 into rust-lang:master Sep 27, 2020
@rustbot rustbot added this to the 1.48.0 milestone Sep 27, 2020
@ecstatic-morse
Copy link
Contributor

The final perf results for this PR are in. Instruction counts have increased on most benchmarks, and task-clock shows no improvement.

This is a bit disappointing, as I expected this PR to be a pretty clear win. The try run also showed small losses across the board (never trust the emoji), although stress tests of the trait resolution code (keccak and inflate) seemed to fair better than they did in the final run. Possibly the change in #77041 caused the discrepancy? @jonas-schievink Am I missing anything? If not, I think we should consider reverting this unless there are benefits outside of perf.

@jonas-schievink
Copy link
Contributor Author

Oh, that's disappointing. But this PR was mostly aimed at improving the generated code, not speeding up rustc.

Seems likely that #77041 has resulted in the same improvements I saw here.

@pftbest
Copy link
Contributor

pftbest commented Sep 29, 2020

It looks to me like the only heavy operation here is a string comparison for spec.target_spec().arch.as_str()

Can it be removed somehow? For example replace it with a check for pointer size == 64

@jonas-schievink
Copy link
Contributor Author

The regressions are in LLVM from what I can tell, not the code I added

@pftbest
Copy link
Contributor

pftbest commented Sep 29, 2020

Oh I see, that's too bad.

@ecstatic-morse
Copy link
Contributor

Ah that's true. It seems like rustc just doesn't benefit from this very much at runtime. Here's the task-clock graph starting from the parent of this commit. Obviously there's a lot of noise here, but check builds remained about the same while codegen-ed builds regressed slightly. That said, I would guess there's code in the ecosystem that would benefit from this, and the compile-time regressions are pretty small.

Comment on lines +614 to +624
/// Returns the maximum size of return values to be passed by value in the Rust ABI.
///
/// Return values beyond this size will use an implicit out-pointer instead.
pub fn max_ret_by_val<C: HasTargetSpec + HasDataLayout>(spec: &C) -> Size {
match spec.target_spec().arch.as_str() {
// System-V will pass return values up to 128 bits in RAX/RDX.
"x86_64" => Size::from_bits(128),

_ => spec.data_layout().pointer_size,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything wrong with 2 * pointer_size? IIRC we already return pairs in two registers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, we already do 2 * pointer_size on all architectures, just not for arbitrary data, only pairs of scalars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Opened #77434.

@jonas-schievink jonas-schievink deleted the ret-in-reg branch October 1, 2020 22:32
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 3, 2020
…-boogalo, r=nagisa

Returns values up to 2*usize by value

Addresses rust-lang#76986 (comment) and rust-lang#76986 (comment) by doing the optimization on all targets.

This matches what we do for functions returning `&[T]` and other fat pointers, so it should be Harmless™
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rust should use registers more aggressively