New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YJIT: Stack temp register allocation for arm64 #7659
Conversation
c45d205
to
f41edb0
Compare
Well, the inline code size seems to have grown, which is the opposite of x86. It's also not what I would have expected. x86 can directly operate on memory operands, whereas arm can't. You would expect that, on arm, the code size savings would be bigger, because there are some loads that we have to insert on arm when we refer to memory operands, which we would not need anymore. Maybe worth doing a bit more investigation and looking at the resulting machine code. See if there are redundant moves and memory operations, or some kind of bug? |
f41edb0
to
abafc36
Compare
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
When we're looking at stats on railsbench, I guess we have to account for the fact that the behavior of railsbench is not that deterministic compared to other smaller benchmarks. Here's the stats of a run with
and here's the stats of a run with
so I also looked at On railsbench, I looked at some code with registers assigned to stack temps. This is an example scenario with smaller code: --yjit-temp-regs=092 bytes # Block: <=>@/opt/rubies/ruby/lib/ruby/3.3.0+0/rubygems/version.rb:373 (ISEQ offset: 150, chain_depth: 1)
# Insn: opt_eq (stack_size: 2)
# guard arg0 fixnum
0x107d3d86c: ldur x11, [x21]
0x107d3d870: tst x11, #1
0x107d3d874: b.eq #0x107d3f318
0x107d3d878: nop
0x107d3d87c: nop
0x107d3d880: nop
0x107d3d884: nop
0x107d3d888: nop
# guard arg1 fixnum
0x107d3d88c: ldur x11, [x21, #8]
0x107d3d890: tst x11, #1
0x107d3d894: b.eq #0x107d3f334
0x107d3d898: nop
0x107d3d89c: nop
0x107d3d8a0: nop
0x107d3d8a4: nop
0x107d3d8a8: nop
0x107d3d8ac: ldur x11, [x21]
0x107d3d8b0: ldur x12, [x21, #8]
0x107d3d8b4: cmp x11, x12
0x107d3d8b8: mov x11, #0x14
0x107d3d8bc: mov x12, #0
0x107d3d8c0: csel x11, x11, x12, eq
0x107d3d8c4: stur x11, [x21] --yjit-temp-regs=576 bytes # Block: <=>@/opt/rubies/ruby/lib/ruby/3.3.0+0/rubygems/version.rb:373 (ISEQ offset: 150, chain_depth: 1)
# reg_temps: 00000000 -> 00000011
# Insn: opt_eq (stack_size: 2)
# guard arg0 fixnum
0x107edc120: tst x1, #1
0x107edc124: b.eq #0x107ede270
0x107edc128: nop
0x107edc12c: nop
0x107edc130: nop
0x107edc134: nop
0x107edc138: nop
# guard arg1 fixnum
0x107edc13c: tst x9, #1
0x107edc140: b.eq #0x107ede2ac
0x107edc144: nop
0x107edc148: nop
0x107edc14c: nop
0x107edc150: nop
0x107edc154: nop
0x107edc158: cmp x1, x9
0x107edc15c: mov x11, #0x14
0x107edc160: mov x12, #0
0x107edc164: csel x11, x11, x12, eq
0x107edc168: mov x1, x11 On the other hand, here's an example that's generated only with # Module#===
# spill_temps: 00000011 -> 00000000
0x107edc1ec: stur x1, [x21]
0x107edc1f0: stur x9, [x21, #8] Savings in load instructions could be offset by spills, especially on railsbench that has many method calls and C func operations. Here's other stats on railsbench:
The |
Hm. Is there nondeterministic behavior in railsbench? I tried to seed the random number generator. I see that
Generally speaking, we should try to remove sources of nondeterminism in the benchmarks (by seeding RNGs and not using RNGs that can't be seeded), and we should also eliminate sources of nondeterminism from YJIT if possible.
It's a bit disappointing when compared to x86. On arm it looks like we're just increasing complexity and compilation time for not much gains (so far). Are the Otherwise, do you have some ideas for making this better? What are the next steps, in your mind? |
I know that you already worked on seeding the random number generator. I think my past change also made it more deterministic. However, from my experience in debugging YJIT on railsbench, it never felt deterministic. One theory is that railsbench may be using threads somewhere and their interruptions are time-sliced. I haven't investigated what could be using threads, but method call counts Shopify/yjit-bench#163 indicate that threads might be used and get interrupted (e.g.
At a glance, it does seem like the
While I've done a few x86_64 backend optimizations as a starter, I still haven't added backend optimizations specifically for stack temp register allocation. I was thinking about trying it next. Spending more time on arm64 code might let me discover something new too. One idea was to avoid using a register when the register operand is used only once and then immediately spilled, hoping that we'll never have a block that has more spill cost than speedups by a register. In addition to instruction elimination discussed above, I also want to improve method calls. Currently, method calls are always spilling arguments before passing them, but I'd like to see if we can do it more efficiently. Once I run out of ideas for stack temp register allocation, I plan to look at allocating registers for local variables. Most spills added for stack temps should be necessary/useful for local variables too, so the complexity added by this effort will not be wasted if it shows more improvements. |
I support this plan 👍 🙂 You've outlined many good ideas for improving stack temp allocation. I think it makes sense to try to improve it as much as possible before moving on to locals. Ideally we can get to the point where we get clear speedups on arm as well. Will merge this PR to make things easier. |
On the branch #7671, |
After recent arm64 optimizations (#7671, #7726, #7744, #7745, #7747), I don't see obvious inefficiency in code that we currently generate. The 1st itr performance was improved by #7748 too. I benchmarked the latest master on arm64 again. The performance on micro and other benchmarks seems comparable to x86_64's one #7651. It's still less impressive on headline benchmarks, but Headline
Other
Micro
|
You did a good job on this project. Well done 👍 Surprised it made so much difference on x86 but so little on arm64. Maybe it's because Apple's arm64 chips have register renaming like newer AMD chips (can cache memory locations in register). One last thing that you could try, if you want to, is to give arm64 more registers, like up to 7 or 8. Should be easier on arm64 since the platform has access to more registers? |
Thank you 👍
I tried giving 8 registers Shopify@f70828a. It doesn't seem to make a significant difference. I guess spills by C calls and method calls happen too often.
|
Would it make sense to try to focus on avoiding spills for some simple cases then? For example, C function calls that we know will not allocate or raise? |
I counted
C function calls that don't allocate or raise ( |
Yeah. Hmm,. I guess I wonder if all the C calls are properly classified as allocating or not allocating (are we being conservative?) 🤔 And then of course, allocations are only problematic if they trigger GC, which can happen, but is a relatively rare event. This goes back to lazy frame pushing ideas and such. Eventually we will probably have to go there, but maybe there are other low-hanging fruits we can look at in the mean time. |
I used four categories above, but all the C function-related ones would need to have a similar solution. So there're just Ruby methods and C function calls. For Ruby methods, it'll be about sharing a set of registers across multiple frames. If a caller inlines a callee and a callee can spill registers for the caller, this spill can be avoided. Whether you skip frame push/pop or not, inlining seems like the only way around it because each spill code needs to be For C function calls, if you use caller-save registers, you have to spill them whether it allocates/raises or not. So other options are: callee-saved registers and immediates. I think it's unfortunately impossible to lazily spill callee-saved registers to the VM stack because we don't know if a C function spills it to the native stack or just doesn't use it. For immediates, you don't need to spill it for GC marking (not sure if leaving an uninitialized/previous stack slot is 100% safe on GC or ObjectSpace though), but you still have to lazily spill them when a catch table interprets the frame. For the above reasons, I feel like optimizing immediates is the only potential low-hanging fruit for optimizing stack temps further. At least we won't have to read them from memory even if already spilled. In addition to that, I guess it's worth trying to skip spilling immediates when the ISEQ doesn't catch exceptions. |
Actually, I thought this might be fine because spilling a callee-saved register to the VM stack will always leave the actual value in the VM stack and/or the machine stack, both of which GC will mark. Because ObjectSpace may return it, it seems risky to write a random value used by a C function to the VM though. However, lazily spilling registers from a C function is not a low-hanging fruit anyway. |
Following up #7651, this PR adds arm64 support for stack temp register allocation. Because benchmark results are generally better in both x86_64 and arm64, this PR also changes the default of
--yjit-temp-regs
to5
.Benchmark
The speedups on M1 seem less dramatic compared to my Linux x86_64 environment (maybe because M1 is faster on memory access?). But still benchmark results seem generally better than not enabling it.
Code size
The following stats are measured on railsbench. Unlike linux-x86_64,
inline_code_size
seems to be increased as well (maybe due to the difference in encoding?). But the totalcode_region_size
increase doesn't seem to be too bad.Before (--yjit-temp-regs=0)
After (--yjit-temp-regs=5)