Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove __bp__ and speed-up bmethod calls #8060

Merged
merged 1 commit into from Jul 17, 2023
Merged

Remove __bp__ and speed-up bmethod calls #8060

merged 1 commit into from Jul 17, 2023

Conversation

XrXr
Copy link
Member

@XrXr XrXr commented Jul 12, 2023

This commit removes the __bp__ field from rb_control_frame_t. It was
introduced to help MJIT, but since MJIT was replaced by RJIT, we can use
vm_base_ptr() to compute it from the SP of the previous control frame
instead. Removing the field avoids needing to set it up when pushing new
frames.

Simply removing __bp__ would cause crashes since RJIT and YJIT used a
slightly different stack layout for bmethod calls than the interpreter.
At the moment of the call, the two layouts looked as follows:

               ┌────────────┐    ┌────────────┐
               │ frame_base │    │ frame_base │
               ├────────────┤    ├────────────┤
               │    ...     │    │    ...     │
               ├────────────┤    ├────────────┤
               │    args    │    │    args    │
               ├────────────┤    └────────────┘<─prev_frame_sp
               │  receiver  │
prev_frame_sp─>└────────────┘
                 RJIT & YJIT      interpreter

Essentially, vm_base_ptr() needs to compute the address to frame_base
given prev_frame_sp in the diagrams. The presence of the receiver
created an off-by-one situation.

Make the interpreter use the layout the JITs use for iseq-to-iseq
bmethod calls. Doing so removes unnecessary argument shifting and
vm_exec_core() re-entry from the interpreter, yielding a speed
improvement visible through benchmark/vm_defined_method.yml:

 patched:   7578743.1 i/s
  master:   4796596.3 i/s - 1.58x  slower

C-to-iseq bmethod calls now store one more VALUE than before, but that
should have negligible impact on overall performance.

Note that re-entering vm_exec_core() used to be necessary for firing
TracePoint events, but that's no longer the case since
9121e57.

Closes #6428

@matzbot matzbot requested a review from a team July 12, 2023 21:49
vm_core.h Show resolved Hide resolved
Copy link
Member

@k0kubun k0kubun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, this is amazing. I didn't expect you would change the interpreter as well, but you already had a patch 🙂 It's much faster too. Great job!

@maximecb
Copy link
Contributor

A wrinkle to this is that RJIT and YJIT use a slightly different stack layout for bmethod calls than the interpreter.

Just wondering if having the interpreter side conform to the JIT layout with the receiver on the stack could help keep things simpler, if the performance cost is not too high? AFAIK the receiver is always on the stack for every other type of call?

@k0kubun
Copy link
Member

k0kubun commented Jul 13, 2023

Just wondering if having the interpreter side conform to the JIT layout with the receiver on the stack could help keep things simpler, if the performance cost is not too high?

I believe that's what the second commit does and it speeds up the interpreter.

@maximecb
Copy link
Contributor

Alan did write:

C-to-ISeq bmethod calls continue to use a different layout (See invoke_iseq_block_from_c())

Just wondering if we could make it so everyone uses the same layout (same as the JITs), to avoid the complexity of managing different layouts with special flags, etc.

@k0kubun
Copy link
Member

k0kubun commented Jul 13, 2023

My bad, thanks for pointing that out.

@XrXr
Copy link
Member Author

XrXr commented Jul 14, 2023

AFAIK the receiver is always on the stack for every other type of call?

Not quite. Blocks don't get their self from the call site like method calls, they typically inherit the self of the scope that encloses the block. They can also get self from instance_eval and friends.

Just wondering if having the interpreter side conform to the JIT layout with the receiver on the stack could help keep things simpler, if the performance cost is not too high?

It's a close call between adding an 8 byte dead store and having the flag. I went with the flag because it's optimal speed wise, but I agree not having the flag is maybe a better trade off. I will try the no flag approach today. (PS. I'd grouped these two commits together because of these two options)

@maximecb
Copy link
Contributor

I will try the no flag approach today. (PS. I'd grouped these two commits together because of these two options)

Thanks for being willing to try it out. I know it's not the most fun thing to refactor things around, but it might make the code cleaner.

This commit removes the __bp__ field from rb_control_frame_t. It was
introduced to help MJIT, but since MJIT was replaced by RJIT, we can use
vm_base_ptr() to compute it from the SP of the previous control frame
instead. Removing the field avoids needing to set it up when pushing new
frames.

Simply removing __bp__ would cause crashes since RJIT and YJIT used a
slightly different stack layout for bmethod calls than the interpreter.
At the moment of the call, the two layouts looked as follows:

                   ┌────────────┐    ┌────────────┐
                   │ frame_base │    │ frame_base │
                   ├────────────┤    ├────────────┤
                   │    ...     │    │    ...     │
                   ├────────────┤    ├────────────┤
                   │    args    │    │    args    │
                   ├────────────┤    └────────────┘<─prev_frame_sp
                   │  receiver  │
    prev_frame_sp─>└────────────┘
                     RJIT & YJIT      interpreter

Essentially, vm_base_ptr() needs to compute the address to frame_base
given prev_frame_sp in the diagrams. The presence of the receiver
created an off-by-one situation.

Make the interpreter use the layout the JITs use for iseq-to-iseq
bmethod calls. Doing so removes unnecessary argument shifting and
vm_exec_core() re-entry from the interpreter, yielding a speed
improvement visible through `benchmark/vm_defined_method.yml`:

     patched:   7578743.1 i/s
      master:   4796596.3 i/s - 1.58x  slower

C-to-iseq bmethod calls now store one more VALUE than before, but that
should have negligible impact on overall performance.

Note that re-entering vm_exec_core() used to be necessary for firing
TracePoint events, but that's no longer the case since
9121e57.

Closes ruby#6428
@maximecb maximecb merged commit f302e72 into ruby:master Jul 17, 2023
91 checks passed
@XrXr XrXr deleted the rm-bp branch July 17, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants