Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to improve codegen for arrays of repeated enums #104384

Closed
wants to merge 2 commits into from

Conversation

clubby789
Copy link
Contributor

@clubby789 clubby789 commented Nov 13, 2022

Fixes #101685

For enums where the tag and value are the same type, we can get better codegen for [Enum; N] by emitting a vector store instruction.

1024 as a limit was picked randomly and is probably too low - I'm unsure how to construct a vector of alternating elements without it being O(n)

@rustbot
Copy link
Collaborator

rustbot commented Nov 13, 2022

r? @wesleywiser

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 13, 2022
@clubby789 clubby789 marked this pull request as ready for review November 14, 2022 00:19
@clubby789 clubby789 changed the title WIP: Attempt to improve codegen for arrays of repeated enums Attempt to improve codegen for arrays of repeated enums Nov 14, 2022
pub fn some_repeat() -> [Option<u8>; 64] {
// CHECK: store <128 x i8>
// CHECK-NEXT: ret void
[Some(0); 64]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting to me that some_repeat seems to do 16 bytes at a time on x64, but none_repeat doesn't. Might be interesting to look at why LLVM is treating them differently.

Also, out of curiosity, what's the assembly difference before/after this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godbolt - This is none_repeat before and after LLVM opts - it looks like LLVM is optimising out the undefined write and forgetting about it. This happened even if I explicitly added store undef.

The current assembly output is

  • None; 64 - 64 single byte movs to the tag
  • Some(0); 64 - A 16 byte constant vector (alternating 0/1) is stored into the array with movups
  • Some(1); 64 - The same as Some(0), but with an all-1's pattern

With this patch

  • None; 64 - 16 byte vector created with xorps, then movups
  • Some(0); 64/Some(1); 64 - Same as before

@scottmcm
Copy link
Member

I'm unsure how to construct a vector of alternating elements without it being O(n)

No idea if this is a good way, but the way that jumps to mind is to shufflevector together shorter ones double the length each time. Though that's still O(n) in the mask vector, so maybe using memory operations to do it would be better.

That said, the other thing that comes to mind is that it might be easier to lower the [x; N] to a call to some lang-item core::array::repeat::<_, N>(x), so these optimizations could be done in Rust code instead of codegen. Then other backends would get them too, and it's not obvious to me that making 1024-element vectors is the best way here. Is more than 64 bytes at once ever better, on current machines? In rust code it could, for example, loop over as_chunks_mut to write chunks at a time and let LLVM then optimize that.

let vec = unsafe { llvm::LLVMGetUndef(self.type_vector(ty, count as u64)) };
let vec = (0..count as usize).fold(vec, |acc, x| {
let elt = [v1, v2][x % 2];
self.insert_element(acc, elt, self.cx.const_i32(x as i32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea if this is better, but you could do it in O(1) LLVM instructions by making a <2 x _> (with two insert_elements) and then repeating that one with a shuffle like

shufflevector <2 x i8> %x, <2 x i8> undef, <64 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>

https://llvm.godbolt.org/z/5feKaEo5z

@clubby789
Copy link
Contributor Author

I've implemented it using shufflevector now, but I've noticed that this means that cases like Some(1); 1024 (which used to be a memset) are now a long series of vector stores. This seems like something LLVM should be handling but isn't - I'll see if I can get better emitted code.

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2022
@clubby789
Copy link
Contributor Author

My original approach worked at the rustc_codegen_ssa level and did a memset with the tag value - that improved the None case, but the Some(1) case got much worse. Something along those lines might help

@bors
Copy link
Contributor

bors commented Nov 17, 2022

☔ The latest upstream changes (presumably #103138) made this pull request unmergeable. Please resolve the merge conflicts.

@clubby789
Copy link
Contributor Author

I'm going to close this since it only targets LLVM, is a hacky way of solving the problem, and actually causes regressions 😅

@clubby789 clubby789 closed this Nov 17, 2022
@clubby789 clubby789 deleted the improve-enum-array branch February 11, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repeating an enum does not generate a memset
5 participants