YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. #11032

nirvdrum · 2024-06-20T20:45:35Z

This PR extends YJIT's method substitution for String#<< to handle integer codepoints as well. If the string is ASCII-8BIT and the codepoint is a byte value, YJIT will dispatch to rb_str_buf_cat_byte as a fast path for working with binary strings. Otherwise, it'll dispatch to the general rb_str_concat just as vm_opt_ltlt would.

rb_str_buf_cat_byte currently works with both ASCII-8BIT and US-ASCII, but this YJIT side only optimizes for ASCII-8BIT. It could be extended easily enough with an additional comparison. The encoding indices for a handful of encodings, including both ASCII-8BIT and US-ASCII are fixed and sequential, so we could also do a range check. For the time being, I've omitted the handling of US-ASCII. I'd like to get feedback on the simplified PR and extend it with US-ASCII handling if needed (which I'm also not convinced we do).

Please advise if the mechanism I'm using to handle polymorphic dispatch is incorrect. We already have jit_rb_str_concat as a method substitution for String#<<, but it deliberately only handled string arguments. I could have merged the two, but that struck me as being complicated. However, I also don't know if it's fine to call between methods like this. And, it ends up duplicating some of the type checks to ensure we dispatch to the correct type depending on what we see at compile time. I was unsure on how to handle the runtime guards, so please pay extra attention to that.

yjit/src/codegen.rs

maximecb · 2024-06-21T15:25:56Z

yjit/src/codegen.rs

+    // In order to use the fast path (rb_str_buf_cat_byte), the string encoding must be ASCII-8BIT
+    // and the codepoint must be in the byte range (0x00 - 0xff).
+    // If either of those conditions are not met we must use the general string concat (str_buf_cat)
+    // function with the original codepoint argument.


As a general comment. I like to avoid generating branches inside the inline block because it can be taxing for branch prediction. This is another discussion, but I think it's useful to think of branch prediction as a finite resource. It works well when you don't have a lot of branches, but the code we generate is going to have (tens of) thousands of them, so the CPU can't always remember which direction a given branch went the last time.

Branches inside the inline block also tends to result in bigger inline code size (worse for instruction cache). I like to make it so that failing guards side-exit to the interpreter (outlined code) as much as possible. One thing you could potentially do is peek at the encoding at compilation time... This would involve speculating that the string being appended to will always have the same encoding. We can measure how often this is the case in practice.

We can continue this discussion in #yjit-internal though. Still open to merging this PR as is.

yjit/src/codegen.rs

maximecb · 2024-06-21T15:32:25Z

Hi Kevin.

Commenting to give you as much feedback as possible.

This PR extends YJIT's method substitution for String#<< to handle integer codepoints as well. If the string is ASCII-8BIT and the codepoint is a byte value, YJIT will dispatch to rb_str_buf_cat_byte as a fast path for working with binary strings.

I think that the idea of this PR is a good one. It seems like a sensible specialization to implement.

I wrote some comments on the code itself.

Two things I would like to see in a PR like this:

Benchmark results to know how it affects performance on protoboeuf and our headline benchmarks (if any impact)
Maybe a dump of machine code generated when concatenating a few bytes in a loop just to see if we're doing anything obviously inefficient

Please advise if the mechanism I'm using to handle polymorphic dispatch is incorrect. We already have jit_rb_str_concat as a method substitution for String#<<, but it deliberately only handled string arguments. I could have merged the two, but that struck me as being complicated. However, I also don't know if it's fine to call between methods like this. And, it ends up duplicating some of the type checks to ensure we dispatch to the correct type depending on what we see at compile time. I was unsure on how to handle the runtime guards, so please pay extra attention to that.

Seems sensible. Duplicating checks that occur at run-time would not be good but the checks you duplicated occur at compilation time. It makes sense to write the logic for this in a separate function to avoid ending up with a megafunction that is hard to follow.

tenderlove · 2024-07-09T22:04:09Z

Before this commit Protoboeuf+YJIT is about 5.28x slower than Google's protobuf implementation, but after this commit it is 4.33x slower.

Before:

ruby 3.4.0dev (2024-07-09T17:22:29Z master 6f6aff56b1) +YJIT [arm64-darwin23]
Warming up --------------------------------------
     encode upstream    14.000 i/100ms
   encode protoboeuf     2.000 i/100ms
Calculating -------------------------------------
     encode upstream    140.854 (± 0.7%) i/s -    714.000 in   5.069352s
   encode protoboeuf     26.672 (± 3.7%) i/s -    134.000 in   5.028086s

Comparison:
     encode upstream:      140.9 i/s
   encode protoboeuf:       26.7 i/s - 5.28x  slower

After:

ruby 3.4.0dev (2024-07-09T19:35:29Z yjit-optimize-stri.. 00cc8e4429) +YJIT [arm64-darwin23]
Warming up --------------------------------------
     encode upstream    13.000 i/100ms
   encode protoboeuf     3.000 i/100ms
Calculating -------------------------------------
     encode upstream    137.078 (± 0.7%) i/s -    689.000 in   5.026818s
   encode protoboeuf     31.678 (± 3.2%) i/s -    159.000 in   5.023590s

Comparison:
     encode upstream:      137.1 i/s
   encode protoboeuf:       31.7 i/s - 4.33x  slower

XrXr

An alternative to what you have is to make a new Rust/C function that essentially does the logic that you currently inline into every site (if codepoint.is_fixnum() { rb_str_buf_cat_byte } else { rb_str_concat }). That way you only generate one ccall at each site.

It should be easier to read than checking in assembler, perf should be about the same, and it's a code size win.

If you write it in C, it could go into string.c and you can additionally avoid removing static from rb_str_buf_cat_byte().

maximecb · 2024-07-09T22:44:38Z

I like Alan's idea of embedding some of the checks in a C function to save on code size and maybe make the code a bit more readable.

Also thanks Kevin for persisting. You picked a hard problem for your first big PR! 😉

nirvdrum · 2024-07-10T15:55:02Z

An alternative to what you have is to make a new Rust/C function that essentially does the logic that you currently inline into every site (if codepoint.is_fixnum() { rb_str_buf_cat_byte } else { rb_str_concat }). That way you only generate one ccall at each site.

Okay. I think you were simplifying, but for completeness the check is really (in pseudo-code) if codepoint.is_fixnum() && receiver.is_binary_string() && codepoint.is_byte(). rb_str_buf_cat_byte does not have that logic. The caller is expected to call it only when those conditions hold (actually, rb_str_buf_cat_byte also works on US-ASCII strings, but I omitted that in YJIT) and there are assertions at the start of rb_str_buf_cat_byte that checks those conditions are held.

If you write it in C, it could go into string.c and you can additionally avoid removing static from rb_str_buf_cat_byte().

I'll have to introduce a new function. I don't think updating rb_str_buf_cat_byte to do those checks is the right way to go. It would mean duplicating checks in the interpreter to derive information we already have.

XrXr · 2024-07-10T16:57:37Z

Yes, you should make clear that the new function you add is a YJIT helper and by not putting it in headers, express that it has many preconditions that are hard to meet and is not for general usage.

yjit/src/codegen.rs

maximecb · 2024-07-24T20:49:26Z

yjit/src/codegen.rs

+    // Ensure the codepoint argument is a Fixnum.
+    let arg = asm.stack_opnd(0);
+    let comptime_arg = jit.peek_at_stack(&asm.ctx, 0);
+    if comptime_arg.fixnum_p() {
+        jit_guard_known_klass(
+            jit,
+            asm,
+            comptime_arg.class_of(),
+            arg,
+            arg.into(),
+            comptime_arg,
+            SEND_MAX_DEPTH,
+            Counter::guard_send_not_fixnums,
+        );
+    } else {
+        return false;
+    }


Here you already checked that comptime_arg is a fixnum in the caller.

Maybe we can also fold the guard that the value is fixnum into rb_yjit_str_concat_codepoint? Because presumably the fallback rb_str_concat(str, codepoint) can handle any kind of input type? @XrXr would this be valid?

You could do if (RB_LIKELY(value is fixnum) && ENCODING_GET_INLINED(str) == rb_ascii8bit_encindex()) {...}

Yes, rb_str_concat() handles everything. And I agree with folding the guard into the C function.

The original version with the assembler required the type to be a Fixnum, but that's no longer the case. I can move this guard easily enough. I just want to confirm that if we see another type at the call site we don't want to try to cut over to jit_rb_str_concat?

If we're going to allow non-codepoint arguments into rb_yjit_str_concat_codepoint, should I rename that function? Or is naming it after its primary intention okay?

I just want to confirm that if we see another type at the call site we don't want to try to cut over to jit_rb_str_concat?

Seems like if it's already behind a FIXNUM_P check in the caller (checking in a different way), so you could turn this into an assert.

Renaming the C function sounds good. Maybe rb_yjit_str_concat_likely_byte? Also a short comment on the C side would be nice.

I was looking at this with @paracycle, too. Since we're talking about turning rb_yjit_str_concat_codepoint into a generic fallback, should we collapse the jit_rb_str_concat and jit_rb_str_concat_codepoint codegen methods into a single handler that calls to this C function?

I don't know the history of jit_rb_str_concat to know whether it's doing more in assembly for speed or because of necessity. That's what I used as the template for my original implementation of jit_rb_str_concat_codepoint and we decided to move things out to C to keep the inline code block smaller and reduce branching. If we applied those same principals here, I suppose we could simplify the codegen a fair bit. But, if that's out of scope, that's fine with me.

Collapsing them into a single C call sounds good to me. Most of today's jit_rb_str_concat seems like something that could be done in the C function.

…odepoint values.

This comment has been minimized.

Sign in to view

maximecb reviewed Jun 21, 2024

View reviewed changes

yjit/src/codegen.rs Outdated Show resolved Hide resolved

maximecb reviewed Jun 21, 2024

View reviewed changes

yjit/src/codegen.rs Outdated Show resolved Hide resolved

maximecb reviewed Jun 21, 2024

View reviewed changes

yjit/src/codegen.rs Show resolved Hide resolved

nirvdrum force-pushed the yjit-optimize-string-append-byte branch 2 times, most recently from 42ff8d8 to f118dd1 Compare June 21, 2024 18:04

nirvdrum force-pushed the yjit-optimize-string-append-byte branch 2 times, most recently from 477f936 to 00cc8e4 Compare July 9, 2024 19:35

XrXr reviewed Jul 9, 2024

View reviewed changes

nirvdrum force-pushed the yjit-optimize-string-append-byte branch from 00cc8e4 to 869d076 Compare July 10, 2024 15:43

nirvdrum marked this pull request as ready for review July 10, 2024 15:43

matzbot requested a review from a team July 10, 2024 15:43

maximecb reviewed Jul 10, 2024

View reviewed changes

yjit/src/codegen.rs Outdated Show resolved Hide resolved

maximecb reviewed Jul 10, 2024

View reviewed changes

yjit/src/codegen.rs Outdated Show resolved Hide resolved

nirvdrum force-pushed the yjit-optimize-string-append-byte branch 4 times, most recently from c89be19 to edc7703 Compare July 24, 2024 20:36

maximecb reviewed Jul 24, 2024

View reviewed changes

nirvdrum force-pushed the yjit-optimize-string-append-byte branch 5 times, most recently from a350c88 to 4026c6d Compare July 25, 2024 17:12

nirvdrum added 3 commits July 25, 2024 14:00

YJIT: Enhance the String#<< method substitution to handle integer c…

0b564b2

…odepoint values.

Document why we need to explicitly spill registers.

899de3a

Simplify passing byte to str_buf_cat.

e6e462c

nirvdrum force-pushed the yjit-optimize-string-append-byte branch from 4026c6d to e6e462c Compare July 25, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. #11032

YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. #11032

nirvdrum commented Jun 20, 2024 •

edited

Loading

This comment has been minimized.

maximecb Jun 21, 2024

maximecb commented Jun 21, 2024

tenderlove commented Jul 9, 2024

XrXr left a comment •

edited

Loading

maximecb commented Jul 9, 2024

nirvdrum commented Jul 10, 2024

XrXr commented Jul 10, 2024

maximecb Jul 24, 2024 •

edited

Loading

XrXr Jul 25, 2024

nirvdrum Jul 25, 2024

nirvdrum Jul 25, 2024

XrXr Jul 25, 2024

nirvdrum Jul 25, 2024

k0kubun Jul 25, 2024

YJIT: Enhance the String#<< method substitution to handle integer codepoint values. #11032

Are you sure you want to change the base?

YJIT: Enhance the String#<< method substitution to handle integer codepoint values. #11032

Conversation

nirvdrum commented Jun 20, 2024 • edited Loading

This comment has been minimized.

maximecb Jun 21, 2024

Choose a reason for hiding this comment

maximecb commented Jun 21, 2024

tenderlove commented Jul 9, 2024

XrXr left a comment • edited Loading

Choose a reason for hiding this comment

maximecb commented Jul 9, 2024

nirvdrum commented Jul 10, 2024

XrXr commented Jul 10, 2024

maximecb Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

XrXr Jul 25, 2024

Choose a reason for hiding this comment

nirvdrum Jul 25, 2024

Choose a reason for hiding this comment

nirvdrum Jul 25, 2024

Choose a reason for hiding this comment

XrXr Jul 25, 2024

Choose a reason for hiding this comment

nirvdrum Jul 25, 2024

Choose a reason for hiding this comment

k0kubun Jul 25, 2024

Choose a reason for hiding this comment

YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. #11032

YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. #11032

nirvdrum commented Jun 20, 2024 •

edited

Loading

XrXr left a comment •

edited

Loading

maximecb Jul 24, 2024 •

edited

Loading