Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YJIT: Optimize cmp REG, 0 into test REG, REG #7471

Merged
merged 1 commit into from Mar 9, 2023

Conversation

k0kubun
Copy link
Member

@k0kubun k0kubun commented Mar 8, 2023

This implements the idea suggested at #7242 (comment). When checking equality of a register against 0, using test instruction generates more compact code.

This is particularly useful when you want to do something like asm.cmp(block_handler, VM_BLOCK_HANDLER_NONE.into()); where you would need to assert VM_BLOCK_HANDLER_NONE == 0 in order to use asm.test.

Generated code

Before

  # block_given?
  ...
  0x5583d243b07b: cmp rax, 0
  0x5583d243b07f: mov eax, 0x14
  0x5583d243b084: mov ecx, 0
  0x5583d243b089: cmove rax, rcx

After

  # block_given?
  ..
  0x5567d51cb07b: test rax, rax
  0x5567d51cb07e: mov eax, 0x14
  0x5567d51cb083: mov ecx, 0
  0x5567d51cb088: cmove rax, rcx

Code size stats

On railsbench, inline code size went down by 0.1%.

Before

inline_code_size:          2,160,054
outlined_code_size:        2,158,816

After

inline_code_size:          2,157,565
outlined_code_size:        2,156,784

@matzbot matzbot requested a review from a team March 8, 2023 00:23
@maximecb
Copy link
Contributor

maximecb commented Mar 8, 2023

It's nice to be able to do this optimization, but if we keep adding passes, there is a risk that compilation time could become a concern.

One way that we could evaluate compilation time is to measure the time taken by 30k_ifelse on the first iteration with --yjit-call-threshold=1 and compare that against the time taken by the CRuby interpreter with YJIT disabled (compute the difference). This should be evaluated on Linux because AFAIK on macOS, the mprotect calls are very slow. The overhead may turn out to be insignificant, but it's still something we should try to quantify.

Otherwise, I would be tempted to ask, can we do this in the x86_merge pass you added the other day? Typically, peephole optimizations like this are done at the end, because the compiler may introduce inefficiencies during compilation.

@k0kubun
Copy link
Member Author

k0kubun commented Mar 8, 2023

I'll benchmark the 1st itr of 30k_ifelse on Linux and report it here.

Otherwise, I would be tempted to ask, can we do this in the x86_merge pass you added the other day? Typically, peephole optimizations like this are done at the end, because the compiler may introduce inefficiencies during compilation.

What I wanted to explain in the code comments was that x86_split inserts Load instructions in between Cmp and Cmove, for example, so at least peeking one instruction doesn't allow this optimization at x86_merge, which is why I placed x86_replace before x86_split. On the other hand, x86_split overwrites all out operands to Insn::Out, so x86_merge cannot be placed before x86_split because Opnd::Reg could be overwritten to Opnd::Out unexpectedly.

@k0kubun
Copy link
Member Author

k0kubun commented Mar 8, 2023

Looks like the 1st itr of 30k_ifelse visualizes the compilation speed pretty well. The slowdown by x86_replace is as follows, and it seems that x86_merge is also adding the same kind of overhead. So it seems wise to keep the number of passes as low as possible.

before: ruby 3.3.0dev (2023-03-07T22:59:37Z master 3e731cd945) +YJIT [x86_64-linux]
after: ruby 3.3.0dev (2023-03-08T00:22:49Z yjit-backend-cmp 3d9ff80b51) +YJIT [x86_64-linux]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
30k_ifelse  904.6        0.0         961.2       0.0         0.94          0.94
----------  -----------  ----------  ----------  ----------  ------------  -------------

I'll think about removing at least one pass here. At least, maybe we could do this as part of x86_split.

@maximecb
Copy link
Contributor

maximecb commented Mar 8, 2023

If we could do it in x86_split that would be attractive 👍

Another potential alternative is to do it right when the instruction is inserted, peek at previous instruction.

@k0kubun
Copy link
Member Author

k0kubun commented Mar 9, 2023

I did that in x86_split 5b0a26c, and it seems like the idea worked 🙂

before: ruby 3.3.0dev (2023-03-08T21:13:23Z master 309dd50a01) +YJIT [x86_64-linux]
after: ruby 3.3.0dev (2023-03-09T00:48:09Z yjit-backend-cmp 5b0a26c393) +YJIT [x86_64-linux]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
30k_ifelse  912.4        0.0         917.4       0.0         0.99          0.99
----------  -----------  ----------  ----------  ----------  ------------  -------------

@maximecb
Copy link
Contributor

maximecb commented Mar 9, 2023

Looks good! Thanks for taking the time to validate the effect on compilation time :)

@maximecb maximecb merged commit 22d8e95 into ruby:master Mar 9, 2023
97 checks passed
@k0kubun k0kubun deleted the yjit-backend-cmp branch March 9, 2023 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants