Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YJIT: Stack temp register allocation for arm64 #7659

Merged
merged 4 commits into from Apr 6, 2023

Conversation

k0kubun
Copy link
Member

@k0kubun k0kubun commented Apr 4, 2023

Following up #7651, this PR adds arm64 support for stack temp register allocation. Because benchmark results are generally better in both x86_64 and arm64, this PR also changes the default of --yjit-temp-regs to 5.

Benchmark

The speedups on M1 seem less dramatic compared to my Linux x86_64 environment (maybe because M1 is faster on memory access?). But still benchmark results seem generally better than not enabling it.

regs=0: ruby 3.3.0dev (2023-04-04T23:36:32Z yjit-stack-arm64 ca7446cd37) +YJIT [arm64-darwin22]
regs=5: ruby 3.3.0dev (2023-04-04T23:36:32Z yjit-stack-arm64 ca7446cd37) +YJIT [arm64-darwin22]

--------------  -----------  ----------  -----------  ----------  -------------  --------------
bench           regs=0 (ms)  stddev (%)  regs=5 (ms)  stddev (%)  regs=0/regs=5  regs=5 1st itr
activerecord    17.7         2.3         17.5         2.2         1.01           0.89
erubi_rails     6.1          10.6        6.1          9.8         1.00           0.65
hexapdf         922.7        1.9         916.6        2.1         1.01           0.97
liquid-c        24.5         3.6         24.4         3.6         1.01           0.82
liquid-render   49.0         3.3         48.4         3.6         1.01           0.86
mail            57.1         2.2         55.3         1.7         1.03           0.87
psych-load      888.2        0.4         880.4        0.5         1.01           1.01
railsbench      646.9        1.6         647.6        1.7         1.00           0.92
ruby-lsp        32.0         26.4        32.2         34.7        1.00           1.02
sequel          30.3         1.6         30.4         1.4         0.99           1.00
binarytrees     112.4        2.6         110.2        2.6         1.02           1.02
chunky_png      370.1        0.3         362.7        0.4         1.02           1.00
erubi           121.4        1.5         121.8        2.1         1.00           1.00
etanni          210.6        0.9         207.3        0.9         1.02           1.02
fannkuchredux   446.0        0.4         401.5        0.2         1.11           1.00
lee             482.6        0.7         474.6        0.9         1.02           1.48
nbody           41.3         0.7         42.3         0.7         0.98           0.96
optcarrot       1434.6       0.6         1364.9       0.6         1.05           1.03
ruby-json       1496.1       0.4         1506.0       0.4         0.99           0.99
rubykon         3413.4       0.4         3196.5       0.4         1.07           1.03
30k_ifelse      350.3        0.8         368.8        2.2         0.95           0.61
30k_methods     837.5        0.6         836.0        0.9         1.00           0.88
cfunc_itself    20.9         0.9         20.6         1.4         1.02           1.03
fib             34.4         0.6         32.4         1.0         1.06           1.07
getivar         40.3         51.7        27.9         71.4        1.44           1.00
keyword_args    30.4         0.8         27.4         0.9         1.11           1.12
respond_to      19.2         0.9         18.6         0.8         1.03           1.02
setivar         10.4         78.1        8.2          91.6        1.27           1.00
setivar_object  28.7         48.4        27.6         50.8        1.04           0.99
setivar_young   28.8         49.2        27.5         50.0        1.05           1.02
str_concat      23.0         2.9         22.3         3.2         1.03           1.01
throw           11.9         0.9         11.9         1.0         1.00           1.00
--------------  -----------  ----------  -----------  ----------  -------------  --------------

Code size

The following stats are measured on railsbench. Unlike linux-x86_64, inline_code_size seems to be increased as well (maybe due to the difference in encoding?). But the total code_region_size increase doesn't seem to be too bad.

Before (--yjit-temp-regs=0)

inline_code_size:          4,238,900
outlined_code_size:        2,153,884
code_region_size:          7,667,712

After (--yjit-temp-regs=5)

inline_code_size:          4,338,116
outlined_code_size:        3,075,936
code_region_size:          7,684,096

@k0kubun k0kubun force-pushed the yjit-stack-arm64 branch 4 times, most recently from c45d205 to f41edb0 Compare April 5, 2023 16:38
@k0kubun k0kubun marked this pull request as ready for review April 5, 2023 17:54
@matzbot matzbot requested a review from a team April 5, 2023 17:54
@maximecb
Copy link
Contributor

maximecb commented Apr 5, 2023

The speedups on M1 seem less dramatic compared to my Linux x86_64 environment (maybe because M1 is faster on memory access?). But still benchmark results seem generally better than not enabling it.

Well, the inline code size seems to have grown, which is the opposite of x86.

It's also not what I would have expected. x86 can directly operate on memory operands, whereas arm can't. You would expect that, on arm, the code size savings would be bigger, because there are some loads that we have to insert on arm when we refer to memory operands, which we would not need anymore.

Maybe worth doing a bit more investigation and looking at the resulting machine code. See if there are redundant moves and memory operations, or some kind of bug?

yjit/src/backend/ir.rs Outdated Show resolved Hide resolved
k0kubun and others added 2 commits April 5, 2023 11:42
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
yjit/src/backend/ir.rs Outdated Show resolved Hide resolved
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
@k0kubun
Copy link
Member Author

k0kubun commented Apr 5, 2023

Maybe worth doing a bit more investigation and looking at the resulting machine code. See if there are redundant moves and memory operations, or some kind of bug?

When we're looking at stats on railsbench, I guess we have to account for the fact that the behavior of railsbench is not that deterministic compared to other smaller benchmarks. Here's the stats of a run with --yjit-temp-regs=0 on arm64:

inline_code_size:          4,372,572
outlined_code_size:        2,153,544
code_region_size:          7,798,784

and here's the stats of a run with --yjit-temp-regs=5 on arm64:

inline_code_size:          4,337,620
outlined_code_size:        3,063,208
code_region_size:          7,815,168

so inline_code_size became slightly smaller in this instance. In a small benchmark like fib, inline_code_size is always smaller (14,600 → 14,468).

I also looked at liquid-c, but it was as random as railsbench. --yjit-temp-regs=5 is sometimes worse, and sometimes significantly better.


On railsbench, I looked at some code with registers assigned to stack temps. This is an example scenario with smaller code:

--yjit-temp-regs=0

92 bytes

  # Block: <=>@/opt/rubies/ruby/lib/ruby/3.3.0+0/rubygems/version.rb:373 (ISEQ offset: 150, chain_depth: 1)
  # Insn: opt_eq (stack_size: 2)
  # guard arg0 fixnum
  0x107d3d86c: ldur x11, [x21]
  0x107d3d870: tst x11, #1
  0x107d3d874: b.eq #0x107d3f318
  0x107d3d878: nop
  0x107d3d87c: nop
  0x107d3d880: nop
  0x107d3d884: nop
  0x107d3d888: nop
  # guard arg1 fixnum
  0x107d3d88c: ldur x11, [x21, #8]
  0x107d3d890: tst x11, #1
  0x107d3d894: b.eq #0x107d3f334
  0x107d3d898: nop
  0x107d3d89c: nop
  0x107d3d8a0: nop
  0x107d3d8a4: nop
  0x107d3d8a8: nop
  0x107d3d8ac: ldur x11, [x21]
  0x107d3d8b0: ldur x12, [x21, #8]
  0x107d3d8b4: cmp x11, x12
  0x107d3d8b8: mov x11, #0x14
  0x107d3d8bc: mov x12, #0
  0x107d3d8c0: csel x11, x11, x12, eq
  0x107d3d8c4: stur x11, [x21]

--yjit-temp-regs=5

76 bytes

  # Block: <=>@/opt/rubies/ruby/lib/ruby/3.3.0+0/rubygems/version.rb:373 (ISEQ offset: 150, chain_depth: 1)
  # reg_temps: 00000000 -> 00000011
  # Insn: opt_eq (stack_size: 2)
  # guard arg0 fixnum
  0x107edc120: tst x1, #1
  0x107edc124: b.eq #0x107ede270
  0x107edc128: nop
  0x107edc12c: nop
  0x107edc130: nop
  0x107edc134: nop
  0x107edc138: nop
  # guard arg1 fixnum
  0x107edc13c: tst x9, #1
  0x107edc140: b.eq #0x107ede2ac
  0x107edc144: nop
  0x107edc148: nop
  0x107edc14c: nop
  0x107edc150: nop
  0x107edc154: nop
  0x107edc158: cmp x1, x9
  0x107edc15c: mov x11, #0x14
  0x107edc160: mov x12, #0
  0x107edc164: csel x11, x11, x12, eq
  0x107edc168: mov x1, x11

On the other hand, here's an example that's generated only with --yjit-temp-regs=5 (8 bytes):

  # Module#===
  # spill_temps: 00000011 -> 00000000
  0x107edc1ec: stur x1, [x21]
  0x107edc1f0: stur x9, [x21, #8]

Savings in load instructions could be offset by spills, especially on railsbench that has many method calls and C func operations. Here's other stats on railsbench:

temp_reg_opnd:               109,487
temp_mem_opnd:                84,168
temp_spill:                   71,843

The temp_reg_opnd / temp_spill ratio is 1.52. On fib, the ratio is 1.70. Even with fib's ratio, the savings in the code size is not that significant. Given the fact that temp_reg_opnd may not always translate to a code size reduction, e.g. stur is just changed to mov, it's not that surprising to me that the code size savings in railsbench on arm64 is minimal or randomly worse.

@maximecb
Copy link
Contributor

maximecb commented Apr 5, 2023

When we're looking at stats on railsbench, I guess we have to account for the fact that the behavior of railsbench is not that deterministic compared to other smaller benchmarks. Here's the stats of a run with --yjit-temp-regs=0 on arm64:

Hm. Is there nondeterministic behavior in railsbench? I tried to seed the random number generator.

I see that SecureRandom is being used here, which might not be seeded though?

benchmarks/railsbench/config/initializers/content_security_policy.rb:# Rails.application.config.content_security_policy_nonce_generator = -> request { SecureRandom.base64(16) }

Generally speaking, we should try to remove sources of nondeterminism in the benchmarks (by seeding RNGs and not using RNGs that can't be seeded), and we should also eliminate sources of nondeterminism from YJIT if possible.

Given the fact that temp_reg_opnd may not always translate to a code size reduction, e.g. stur is just changed to mov, it's not that surprising to me that the code size savings in railsbench on arm64 is minimal or randomly worse.

It's a bit disappointing when compared to x86. On arm it looks like we're just increasing complexity and compilation time for not much gains (so far).

Are the stur instructions mostly being used for writing the output operands of instructions? If so, I wonder if we could somehow do better by looking ahead and trying to place the output value into the output register directly when allocating instruction outputs.

Otherwise, do you have some ideas for making this better? What are the next steps, in your mind?

@k0kubun
Copy link
Member Author

k0kubun commented Apr 5, 2023

Is there nondeterministic behavior in railsbench? I tried to seed the random number generator.

I know that you already worked on seeding the random number generator. I think my past change also made it more deterministic.

However, from my experience in debugging YJIT on railsbench, it never felt deterministic.

One theory is that railsbench may be using threads somewhere and their interruptions are time-sliced. I haven't investigated what could be using threads, but method call counts Shopify/yjit-bench#163 indicate that threads might be used and get interrupted (e.g. #<Class:Thread>#handle_interrupt).

Are the stur instructions mostly being used for writing the output operands of instructions? If so, I wonder if we could somehow do better by looking ahead and trying to place the output value into the output register directly when allocating instruction outputs.

At a glance, it does seem like the csel + mov (stur) combo could eliminate the mov instruction if the destination is not a memory operand. I can have a look at improving that in the backend, and keep looking for other such examples.

do you have some ideas for making this better? What are the next steps, in your mind?

While I've done a few x86_64 backend optimizations as a starter, I still haven't added backend optimizations specifically for stack temp register allocation. I was thinking about trying it next. Spending more time on arm64 code might let me discover something new too.

One idea was to avoid using a register when the register operand is used only once and then immediately spilled, hoping that we'll never have a block that has more spill cost than speedups by a register.

In addition to instruction elimination discussed above, I also want to improve method calls. Currently, method calls are always spilling arguments before passing them, but I'd like to see if we can do it more efficiently.

Once I run out of ideas for stack temp register allocation, I plan to look at allocating registers for local variables. Most spills added for stack temps should be necessary/useful for local variables too, so the complexity added by this effort will not be wasted if it shows more improvements.

@maximecb
Copy link
Contributor

maximecb commented Apr 6, 2023

While I've done a few x86_64 backend optimizations as a starter, I still haven't added backend optimizations specifically for stack temp register allocation. I was thinking about trying it next. Spending more time on arm64 code might let me discover something new too.

I support this plan 👍 🙂

You've outlined many good ideas for improving stack temp allocation. I think it makes sense to try to improve it as much as possible before moving on to locals. Ideally we can get to the point where we get clear speedups on arm as well.

Will merge this PR to make things easier.

@maximecb maximecb merged commit 89bdf6e into ruby:master Apr 6, 2023
97 checks passed
@maximecb maximecb deleted the yjit-stack-arm64 branch April 6, 2023 15:35
@k0kubun
Copy link
Member Author

k0kubun commented Apr 7, 2023

On the branch #7671, --yjit-temp-regs=5 very stably produces a smaller inline_code_size than --yjit-temp-regs=0. I think stack temp register allocation had a positive impact on the inline code size for arm64 as well, but the noise by railsbench's randomness was probably amplified by nop instructions.

@k0kubun
Copy link
Member Author

k0kubun commented Apr 21, 2023

After recent arm64 optimizations (#7671, #7726, #7744, #7745, #7747), I don't see obvious inefficiency in code that we currently generate. The 1st itr performance was improved by #7748 too. I benchmarked the latest master on arm64 again.

The performance on micro and other benchmarks seems comparable to x86_64's one #7651. It's still less impressive on headline benchmarks, but mail benchmark reliably shows a speedup, so it'd be useful for it at least.

Headline

regs=0: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]
regs=5: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]

--------------  -----------  ----------  -----------  ----------  --------------  -------------
bench           regs=0 (ms)  stddev (%)  regs=5 (ms)  stddev (%)  regs=5 1st itr  regs=0/regs=5
activerecord    17.3         3.6         17.2         2.5         0.99            1.01
erubi_rails     6.0          6.0         6.0          5.9         0.78            1.00
hexapdf         905.2        1.1         904.1        1.4         0.99            1.00
liquid-c        25.0         4.2         25.0         3.9         0.90            1.00
liquid-compile  27.1         2.6         27.2         2.5         0.89            1.00
liquid-render   49.3         3.0         49.0         3.1         0.98            1.01
mail            57.9         1.8         55.1         1.1         0.94            1.05
psych-load      900.9        0.4         887.5        0.8         0.97            1.02
railsbench      643.0        2.5         643.7        1.3         0.93            1.00
ruby-lsp        31.2         24.1        31.2         28.0        1.11            1.00
sequel          30.7         1.7         30.4         1.4         1.02            1.01
--------------  -----------  ----------  -----------  ----------  --------------  -------------

Other

regs=0: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]
regs=5: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]

-------------  -----------  ----------  -----------  ----------  --------------  -------------
bench          regs=0 (ms)  stddev (%)  regs=5 (ms)  stddev (%)  regs=5 1st itr  regs=0/regs=5
binarytrees    112.9        2.7         109.5        2.8         1.02            1.03
chunky_png     339.4        0.4         323.3        0.8         1.08            1.05
erubi          119.0        1.5         118.5        1.4         1.09            1.00
etanni         218.8        0.8         217.1        1.1         1.01            1.01
fannkuchredux  445.1        0.2         355.5        0.7         1.00            1.25
lee            478.3        1.0         469.5        0.9         1.44            1.02
nbody          41.4         0.5         40.8         0.7         1.00            1.01
optcarrot      1235.5       0.5         1088.7       0.4         1.20            1.13
ruby-json      1536.8       0.2         1533.2       0.3         1.02            1.00
rubykon        3325.8       0.4         3204.9       0.6         1.05            1.04
-------------  -----------  ----------  -----------  ----------  --------------  -------------

Micro

regs=0: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]
regs=5: ruby 3.3.0dev (2023-04-20T23:09:16Z master 072ef7a1aa) +YJIT [arm64-darwin22]

--------------  -----------  ----------  -----------  ----------  --------------  -------------
bench           regs=0 (ms)  stddev (%)  regs=5 (ms)  stddev (%)  regs=5 1st itr  regs=0/regs=5
30k_ifelse      312.3        1.5         311.4        1.2         0.75            1.00
30k_methods     763.6        1.1         756.8        1.4         0.97            1.01
cfunc_itself    21.0         0.8         19.7         0.9         1.06            1.06
fib             34.3         0.8         25.4         0.7         1.35            1.35
getivar         40.3         51.9        27.9         71.3        1.00            1.45
keyword_args    30.0         0.7         28.0         0.9         1.06            1.07
respond_to      19.6         0.8         15.3         0.9         1.28            1.28
setivar         10.5         78.0        8.2          90.5        1.00            1.28
setivar_object  28.7         48.2        27.4         50.7        0.99            1.05
setivar_young   28.8         48.1        27.5         50.1        0.99            1.05
str_concat      21.5         3.4         20.1         4.5         1.05            1.07
throw           11.9         1.1         11.7         1.2         1.01            1.01
--------------  -----------  ----------  -----------  ----------  --------------  -------------

@maximecb
Copy link
Contributor

You did a good job on this project. Well done 👍

Surprised it made so much difference on x86 but so little on arm64. Maybe it's because Apple's arm64 chips have register renaming like newer AMD chips (can cache memory locations in register).

One last thing that you could try, if you want to, is to give arm64 more registers, like up to 7 or 8. Should be easier on arm64 since the platform has access to more registers?

@k0kubun
Copy link
Member Author

k0kubun commented Apr 21, 2023

Thank you 👍

One last thing that you could try, if you want to, is to give arm64 more registers, like up to 7 or 8.

I tried giving 8 registers Shopify@f70828a. It doesn't seem to make a significant difference. I guess spills by C calls and method calls happen too often.

regs=0: ruby 3.3.0dev (2023-04-21T16:19:21Z yjit-regs-8 f70828a3c5) +YJIT [arm64-darwin22]
regs=8: ruby 3.3.0dev (2023-04-21T16:19:21Z yjit-regs-8 f70828a3c5) +YJIT [arm64-darwin22]

--------------  -----------  ----------  -----------  ----------  --------------  -------------
bench           regs=0 (ms)  stddev (%)  regs=8 (ms)  stddev (%)  regs=8 1st itr  regs=0/regs=8
activerecord    17.1         2.5         16.8         2.0         0.93            1.02
erubi_rails     6.1          5.8         6.1          6.3         0.80            1.00
hexapdf         906.3        2.1         892.7        1.1         0.97            1.02
liquid-c        24.7         3.6         24.7         3.5         0.89            1.00
liquid-compile  27.1         2.2         27.1         1.8         0.90            1.00
liquid-render   48.7         3.2         48.4         3.3         0.92            1.01
mail            57.0         1.4         55.0         1.4         0.92            1.04
psych-load      897.4        0.3         890.6        0.5         1.02            1.01
railsbench      641.7        1.6         645.8        1.9         0.88            0.99
ruby-lsp        31.1         23.8        31.3         26.7        0.93            0.99
sequel          30.1         1.4         30.1         1.2         0.99            1.00
--------------  -----------  ----------  -----------  ----------  --------------  -------------

@maximecb
Copy link
Contributor

Would it make sense to try to focus on avoiding spills for some simple cases then? For example, C function calls that we know will not allocate or raise?

@k0kubun
Copy link
Member Author

k0kubun commented Apr 21, 2023

I counted asm.spill_temps() call reasons again Shopify@b266402:

spill reasons:
       ccall_alloc:      4,397 (39.8%)
       method_ruby:      3,002 (27.2%)
          method_c:      2,647 (24.0%)
    ccall_no_alloc:        999 ( 9.0%)

C function calls that don't allocate or raise (ccall_no_alloc) are only 9%, and we'd have to use callee-saved registers for it, which could incur another overhead on YJIT entry and leave. We could try, but I wouldn't expect a dramatic impact out of it.

@maximecb
Copy link
Contributor

Yeah. Hmm,. I guess I wonder if all the C calls are properly classified as allocating or not allocating (are we being conservative?) 🤔 And then of course, allocations are only problematic if they trigger GC, which can happen, but is a relatively rare event. This goes back to lazy frame pushing ideas and such. Eventually we will probably have to go there, but maybe there are other low-hanging fruits we can look at in the mean time.

@k0kubun
Copy link
Member Author

k0kubun commented Apr 21, 2023

I used four categories above, but all the C function-related ones would need to have a similar solution. So there're just Ruby methods and C function calls.

For Ruby methods, it'll be about sharing a set of registers across multiple frames. If a caller inlines a callee and a callee can spill registers for the caller, this spill can be avoided. Whether you skip frame push/pop or not, inlining seems like the only way around it because each spill code needs to be RegTemps-specific.

For C function calls, if you use caller-save registers, you have to spill them whether it allocates/raises or not. So other options are: callee-saved registers and immediates. I think it's unfortunately impossible to lazily spill callee-saved registers to the VM stack because we don't know if a C function spills it to the native stack or just doesn't use it. For immediates, you don't need to spill it for GC marking (not sure if leaving an uninitialized/previous stack slot is 100% safe on GC or ObjectSpace though), but you still have to lazily spill them when a catch table interprets the frame.

For the above reasons, I feel like optimizing immediates is the only potential low-hanging fruit for optimizing stack temps further. At least we won't have to read them from memory even if already spilled. In addition to that, I guess it's worth trying to skip spilling immediates when the ISEQ doesn't catch exceptions.

@k0kubun
Copy link
Member Author

k0kubun commented Apr 21, 2023

I think it's unfortunately impossible to lazily spill callee-saved registers to the VM stack because we don't know if a C function spills it to the native stack or just doesn't use it.

Actually, I thought this might be fine because spilling a callee-saved register to the VM stack will always leave the actual value in the VM stack and/or the machine stack, both of which GC will mark. Because ObjectSpace may return it, it seems risky to write a random value used by a C function to the VM though.

However, lazily spilling registers from a C function is not a low-hanging fruit anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants