New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed struct rmatch into GC slot #8097
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good. Can you also run some of the yjit-bench headline benchmarks and write a few microbenchmarks to measure the performance and memory?
include/ruby/internal/core/rmatch.h
Outdated
typedef struct rb_matchext_struct { | ||
/** | ||
* The result of this match. | ||
*/ | ||
struct rmatch rmatch; | ||
} rb_matchext_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of wrapping the struct rmatch
inside of struct rb_matchext_struct
, can we just allocate the rmatch
after the RMatch
? Saves one level of wrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's easy to do, but I prefer not to remove the struct rb_matchext_struct
for two reasons.
- The name makes the purpose and the allocation clear to the reader. Developers who are already familiar with
rb_classext_t
should immediately recognise whatrb_matchext_t
is for, and where it is allocated. - It makes it easy to extend. In the future, if we need more fields to be allocated together in the GC slot, we can just add more fields to the
struct rb_matchext_struct
. For example, we can movestruct rmatch::char_offset
intostruct rb_matchext_struct
as a fieldstruct rmatch_offset char_offsets[];
. (Of course we don't have to do it right now.) If we remove this level of struct, we may need to add more macros to access additional fields. For example,#define RMATCH_CHAR_OFFSETS(m) (struct rmatch_offset*)((char*)m + sizeof(struct RMatch) + sizeof(struct rmatch))
. That's not as clear as simplyRMATCH_EXT(m)->char_offsets
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we then rename struct rmatch
(which, IMO, is a terrible name because it's confusing with struct RMatch
) to struct rb_matchext_struct
? We can then just add fields to struct rb_matchext_struct
in the future if we want to add more fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we then rename
struct rmatch
(which, IMO, is a terrible name because it's confusing withstruct RMatch
) tostruct rb_matchext_struct
? We can then just add fields tostruct rb_matchext_struct
in the future if we want to add more fields.
Good idea. I also feel it confusing to have both struct rmatch
and struct RMatch
.
... because it is already part of the slot.
It is confusing to have both `struct RMatch` and `struct rmatch`. Now we rename `struct rmatch` to `rb_matchext_t`.
Here are the results of running yjit-bench on an Intel i7-6700k Skylake machine. The merge base
This PR:
The following table shows the ratio between the values of this PR and the merge base (<1.0 means speed-up and >1.0 means slow down) . The last row "(geomean)" is the geometric mean of other rows.
The difference is small for most benchmarks. Some benchmarks has noticeable improvement (ratio < 0.9) in interpreter time, including I'll try to run liquid again, and make some microbenchmarks that make heavy use of regular expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran benchmarks of this branch against the base branch on my machine (AMD Ryzen 3600X).
This PR shouldn't affect the performance of YJIT so it doesn't make much sense to compare with vs. without YJIT. The following command compares between different Ruby builds:
./run_benchmarks.rb --headline -e "base::/home/peter/src/ruby-master/install/bin/ruby" -e "branch::/home/peter/src/ruby/install/bin/ruby" --rss
I get the following results:
-------------- --------- ---------- --------- ----------- ---------- --------- -------------- -----------
bench base (ms) stddev (%) RSS (MiB) branch (ms) stddev (%) RSS (MiB) branch 1st itr base/branch
activerecord 72.5 2.1 52.0 71.5 2.2 51.9 1.01 1.01
chunky-png 883.7 0.3 41.5 889.0 0.3 43.2 1.00 0.99
erubi-rails 20.8 13.4 91.3 20.5 14.4 90.5 1.01 1.01
hexapdf 2582.1 0.6 182.8 2578.5 0.9 197.2 1.02 1.00
liquid-c 66.0 0.1 33.7 65.4 0.3 34.5 0.99 1.01
liquid-compile 61.8 0.8 32.6 61.9 0.1 31.0 1.06 1.00
liquid-render 165.2 0.2 32.8 164.0 0.4 31.6 1.01 1.01
mail 136.5 0.1 46.5 134.1 0.1 46.9 1.01 1.02
psych-load 2166.1 0.1 33.3 2081.8 0.1 30.7 1.04 1.04
railsbench 2031.9 0.5 89.0 2037.6 0.5 89.3 1.00 1.00
ruby-lsp 66.5 2.9 90.0 65.5 3.0 92.6 1.00 1.01
sequel 73.4 0.9 36.6 73.5 0.9 36.6 1.00 1.00
-------------- --------- ---------- --------- ----------- ---------- --------- -------------- -----------
It looks like there's a small speedup in psych-load and mail. The other benchmarks look largely unchaged. I think it's good to ship this PR.
I used this microbenchmark: re = /(\d+),(\d+)/
ARGV[0].to_i.times do |i|
s = "#{i},#{i+1}"
m = re.match(s)
end
p GC::stat I ran it on the same Intel i7-6700k Skylake machine with the same builds. The difference is so obvious that I am not even going to calculate the average of several executions. Merge base: {:count=>3676, :time=>1276, :marking_time=>54, :sweeping_time=>1222, :heap_allocated_pages=>34, :heap_sorted_length=>208, :heap_allocatable_pages=>174, :heap_available_slots=>34583, :heap_live_slots=>31881, :heap_free_slots=>2702, :heap_final_slots=>0, :heap_marked_slots=>18324, :heap_eden_pages=>34, :heap_tomb_pages=>0, :total_allocated_pages=>34, :total_freed_pages=>0, :total_allocated_objects=>50069392, :total_freed_objects=>50037511, :malloc_increase_bytes=>50816, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>3673, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>62496, :oldmalloc_increase_bytes_limit=>16777216} This PR: {:count=>2781, :time=>1032, :marking_time=>50, :sweeping_time=>982, :heap_allocated_pages=>40, :heap_sorted_length=>208, :heap_allocatable_pages=>168, :heap_available_slots=>39490, :heap_live_slots=>31392, :heap_free_slots=>8098, :heap_final_slots=>0, :heap_marked_slots=>18326, :heap_eden_pages=>40, :heap_tomb_pages=>0, :total_allocated_pages=>40, :total_freed_pages=>0, :total_allocated_objects=>50069391, :total_freed_objects=>50037999, :malloc_increase_bytes=>1056, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>2778, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>1056, :oldmalloc_increase_bytes_limit=>16777216} There is an obvious drop in Liquid benchmark: ( Merge base:
This PR:
The improvement is measurable but not significant. I think it is because the proportion of time spent in allocation and GC is smaller than the micro benchmark, and underlying buffers of Strings and Arrays still dominates the time of sweeping in the Liquid benchmark, making the cost of freeing one of three underlying buffers in MatchData not that obvious. |
This commit makes use of the variable-width allocation feature, and allocates the
struct RMatch
and the underlyingstruct rmatch
together as one GC object, similar to howstruct RClass
andrb_classext_t
are allocated together. By making this change, we can reduce one level of indirection when accessing, and also reduce the amount of memory allocated withxmalloc
and the necessaryxfree
to be called.Some applications (such as Liquid) use regular expressions very frequently, and they generate a large amount of
MatchData
instances.