Embed struct rmatch into GC slot #8097

wks · 2023-07-19T09:33:50Z

This commit makes use of the variable-width allocation feature, and allocates the struct RMatch and the underlying struct rmatch together as one GC object, similar to how struct RClass and rb_classext_t are allocated together. By making this change, we can reduce one level of indirection when accessing, and also reduce the amount of memory allocated with xmalloc and the necessary xfree to be called.

Some applications (such as Liquid) use regular expressions very frequently, and they generate a large amount of MatchData instances.

peterzhu2118

The implementation looks good. Can you also run some of the yjit-bench headline benchmarks and write a few microbenchmarks to measure the performance and memory?

peterzhu2118 · 2023-07-19T12:44:58Z

include/ruby/internal/core/rmatch.h

+typedef struct rb_matchext_struct {
+    /**
+     * The result of this match.
+     */
+    struct rmatch rmatch;
+} rb_matchext_t;


Instead of wrapping the struct rmatch inside of struct rb_matchext_struct, can we just allocate the rmatch after the RMatch? Saves one level of wrapping.

That's easy to do, but I prefer not to remove the struct rb_matchext_struct for two reasons.

The name makes the purpose and the allocation clear to the reader. Developers who are already familiar with rb_classext_t should immediately recognise what rb_matchext_t is for, and where it is allocated.

It makes it easy to extend. In the future, if we need more fields to be allocated together in the GC slot, we can just add more fields to the struct rb_matchext_struct. For example, we can move struct rmatch::char_offset into struct rb_matchext_struct as a field struct rmatch_offset char_offsets[];. (Of course we don't have to do it right now.) If we remove this level of struct, we may need to add more macros to access additional fields. For example, #define RMATCH_CHAR_OFFSETS(m) (struct rmatch_offset*)((char*)m + sizeof(struct RMatch) + sizeof(struct rmatch)). That's not as clear as simply RMATCH_EXT(m)->char_offsets.

Can we then rename struct rmatch (which, IMO, is a terrible name because it's confusing with struct RMatch) to struct rb_matchext_struct? We can then just add fields to struct rb_matchext_struct in the future if we want to add more fields.

Can we then rename struct rmatch (which, IMO, is a terrible name because it's confusing with struct RMatch) to struct rb_matchext_struct? We can then just add fields to struct rb_matchext_struct in the future if we want to add more fields.

Good idea. I also feel it confusing to have both struct rmatch and struct RMatch.

gc.c

... because it is already part of the slot.

It is confusing to have both `struct RMatch` and `struct rmatch`. Now we rename `struct rmatch` to `rb_matchext_t`.

wks · 2023-07-20T14:22:32Z

Here are the results of running yjit-bench on an Intel i7-6700k Skylake machine.

The merge base

interp: ruby 3.3.0dev (2023-07-19T03:42:20Z :detached: a3a74771f2) [x86_64-linux]
yjit: ruby 3.3.0dev (2023-07-19T03:42:20Z :detached: a3a74771f2) +YJIT [x86_64-linux]

--------------  -----------  ----------  ---------  ----------  ------------  -----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
activerecord    76.1         2.5         39.0       4.5         1.49          1.95       
chunky-png      782.1        0.2         472.0      0.4         1.54          1.66       
erubi-rails     22.2         11.8        13.6       20.7        0.29          1.62       
hexapdf         2660.7       1.2         1701.8     1.9         1.31          1.56       
liquid-c        66.2         0.3         46.2       1.0         0.74          1.43       
liquid-compile  62.0         3.6         44.6       2.9         0.66          1.39       
liquid-render   157.0        1.8         81.3       1.4         1.16          1.93       
mail            137.0        0.9         101.8      0.2         0.71          1.35       
psych-load      1954.6       0.1         1365.0     0.5         1.41          1.43       
railsbench      2233.1       0.4         1499.3     0.8         1.21          1.49       
ruby-lsp        66.6         2.9         45.6       25.3        0.57          1.46       
sequel          77.4         1.0         59.0       3.1         1.27          1.31       
binarytrees     370.7        1.3         171.1      1.5         2.09          2.17       
erubi           224.0        0.1         178.6      0.1         1.23          1.25       
etanni          307.4        0.0         307.7      0.1         1.00          1.00       
fannkuchredux   1668.8       0.6         550.4      0.1         1.00          3.03       
lee             981.3        0.9         673.1      1.0         1.42          1.46       
nbody           105.0        0.1         55.1       0.1         1.79          1.91       
optcarrot       4802.2       0.6         1716.1     0.5         2.61          2.80       
rack            105.7        1.1         86.4       1.5         1.12          1.22       
ruby-json       3001.5       0.0         2576.1     0.2         1.16          1.17       
rubykon         10105.9      0.4         5086.3     0.7         2.03          1.99       
30k_ifelse      2316.9       0.0         237.4      0.0         1.35          9.76       
30k_methods     5786.5       0.0         579.2      0.0         5.33          9.99       
cfunc_itself    89.9         0.3         27.3       0.2         3.19          3.30       
fib             197.7        0.1         33.0       0.1         5.90          5.99       
getivar         97.9         0.1         18.6       75.7        1.00          5.27       
keyword_args    219.7        1.7         34.6       0.3         6.09          6.35       
respond_to      223.7        0.0         18.3       0.5         11.79         12.24      
setivar         60.9         0.2         9.8        64.5        1.00          6.21       
setivar_object  94.1         0.8         38.0       38.2        1.00          2.47       
setivar_young   94.3         0.9         36.4       40.3        0.99          2.59       
str_concat      65.0         3.1         30.8       0.3         1.57          2.12       
throw           23.1         0.2         18.2       0.3         1.22          1.26       
--------------  -----------  ----------  ---------  ----------  ------------  -----------
Legend:
- yjit 1st itr: ratio of interp/yjit time for the first benchmarking iteration.
- interp/yjit: ratio of interp/yjit time. Higher is better for yjit. Above 1 represents a speedup.

This PR:

interp: ruby 3.3.0dev (2023-07-19T13:48:16Z :detached: f9654b2b28) [x86_64-linux]
yjit: ruby 3.3.0dev (2023-07-19T13:48:16Z :detached: f9654b2b28) +YJIT [x86_64-linux]

--------------  -----------  ----------  ---------  ----------  ------------  -----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
activerecord    76.7         2.3         39.0       4.5         1.48          1.97       
chunky-png      756.5        0.2         464.7      0.4         1.51          1.63       
erubi-rails     22.2         11.9        13.7       19.4        0.35          1.62       
hexapdf         2639.6       1.3         1693.3     0.9         1.33          1.56       
liquid-c        65.1         0.4         45.7       1.1         0.74          1.42       
liquid-compile  62.5         0.3         45.7       0.4         0.65          1.37       
liquid-render   157.4        0.4         82.8       0.2         1.16          1.90       
mail            134.6        0.1         99.6       0.1         0.71          1.35       
psych-load      1905.8       0.1         1335.0     0.2         1.41          1.43       
railsbench      2241.9       0.6         1503.6     0.9         1.23          1.49       
ruby-lsp        65.8         2.9         45.5       25.6        0.57          1.45       
sequel          76.4         1.0         57.4       1.6         1.29          1.33       
binarytrees     366.1        1.4         175.0      2.0         2.04          2.09       
erubi           225.4        0.0         179.9      0.1         1.22          1.25       
etanni          311.1        0.0         309.1      0.1         1.00          1.01       
fannkuchredux   1655.8       0.2         551.7      0.1         1.00          3.00       
lee             971.6        1.0         679.0      1.2         1.39          1.43       
nbody           104.7        0.2         51.5       0.0         1.89          2.03       
optcarrot       4789.8       0.6         1708.4     0.6         2.62          2.80       
rack            103.5        1.2         82.2       1.6         1.17          1.26       
ruby-json       2986.7       0.7         2597.1     0.1         1.15          1.15       
rubykon         10048.7      1.1         5082.7     0.4         2.03          1.98       
30k_ifelse      2327.3       0.1         237.3      0.1         1.34          9.81       
30k_methods     5785.9       0.0         579.2      0.0         5.30          9.99       
cfunc_itself    80.5         0.4         27.7       0.5         2.83          2.91       
fib             195.1        0.1         33.0       0.1         5.83          5.90       
getivar         86.4         0.4         18.1       63.9        1.03          4.77       
keyword_args    220.3        0.0         34.3       0.3         6.19          6.41       
respond_to      212.7        0.2         19.9       1.2         10.29         10.71      
setivar         50.9         0.6         9.6        52.2        1.00          5.28       
setivar_object  81.5         0.7         33.8       33.4        1.02          2.41       
setivar_young   81.2         0.6         33.9       33.4        1.00          2.39       
str_concat      62.0         0.3         32.4       0.3         1.47          1.91       
throw           23.1         0.2         18.1       0.3         1.23          1.28       
--------------  -----------  ----------  ---------  ----------  ------------  -----------
Legend:
- yjit 1st itr: ratio of interp/yjit time for the first benchmarking iteration.
- interp/yjit: ratio of interp/yjit time. Higher is better for yjit. Above 1 represents a speedup.

The following table shows the ratio between the values of this PR and the merge base (<1.0 means speed-up and >1.0 means slow down) . The last row "(geomean)" is the geometric mean of other rows.

bench	interp:base	interp:pr	interp:ratio	yjit:base	yjit:pr	yjit:ratio
30k_ifelse	2316.9	2327.3	1.004	237.4	237.3	1
30k_methods	5786.5	5785.9	1	579.2	579.2	1
activerecord	76.1	76.7	1.007	39	39	0.999
binarytrees	370.7	366.1	0.988	171.1	175	1.023
cfunc_itself	89.9	80.5	0.895	27.3	27.7	1.015
chunky-png	782.1	756.5	0.967	472	464.7	0.985
erubi	224	225.4	1.006	178.6	179.9	1.007
erubi-rails	22.2	22.2	1.002	13.6	13.7	1.003
etanni	307.4	311.1	1.012	307.7	309.1	1.005
fannkuchredux	1668.8	1655.8	0.992	550.4	551.7	1.002
fib	197.7	195.1	0.987	33	33	1.001
getivar	97.9	86.4	0.883	18.6	18.1	0.976
hexapdf	2660.7	2639.6	0.992	1701.8	1693.3	0.995
keyword_args	219.7	220.3	1.003	34.6	34.3	0.992
lee	981.3	971.6	0.99	673.1	679	1.009
liquid-c	66.2	65.1	0.984	46.2	45.7	0.99
liquid-compile	62	62.5	1.008	44.6	45.7	1.025
liquid-render	157	157.4	1.003	81.3	82.8	1.018
mail	137	134.6	0.982	101.8	99.6	0.979
nbody	105	104.7	0.997	55.1	51.5	0.936
optcarrot	4802.2	4789.8	0.997	1716.1	1708.4	0.995
psych-load	1954.6	1905.8	0.975	1365	1335	0.978
rack	105.7	103.5	0.979	86.4	82.2	0.951
railsbench	2233.1	2241.9	1.004	1499.3	1503.6	1.003
respond_to	223.7	212.7	0.951	18.3	19.9	1.087
ruby-json	3001.5	2986.7	0.995	2576.1	2597.1	1.008
ruby-lsp	66.6	65.8	0.988	45.6	45.5	0.997
rubykon	10105.9	10048.7	0.994	5086.3	5082.7	0.999
sequel	77.4	76.4	0.986	59	57.4	0.974
setivar	60.9	50.9	0.835	9.8	9.6	0.982
setivar_object	94.1	81.5	0.866	38	33.8	0.889
setivar_young	94.3	81.2	0.861	36.4	33.9	0.931
str_concat	65	62	0.953	30.8	32.4	1.055
throw	23.1	23.1	1.003	18.2	18.1	0.991
(geomean)	288.4	280.4	0.972	124.4	123.6	0.993

The difference is small for most benchmarks. Some benchmarks has noticeable improvement (ratio < 0.9) in interpreter time, including cfunc_itself, getivar and setivar*. It's hard to explain because those benchmarks are not related to regular expressions. Maybe the parser used by the compiler uses regular expression under the hood, resulting in the improvement being visible elsewhere.

I'll try to run liquid again, and make some microbenchmarks that make heavy use of regular expressions.

peterzhu2118

I ran benchmarks of this branch against the base branch on my machine (AMD Ryzen 3600X).

This PR shouldn't affect the performance of YJIT so it doesn't make much sense to compare with vs. without YJIT. The following command compares between different Ruby builds:

./run_benchmarks.rb --headline -e "base::/home/peter/src/ruby-master/install/bin/ruby" -e "branch::/home/peter/src/ruby/install/bin/ruby" --rss

I get the following results:

--------------  ---------  ----------  ---------  -----------  ----------  ---------  --------------  -----------
bench           base (ms)  stddev (%)  RSS (MiB)  branch (ms)  stddev (%)  RSS (MiB)  branch 1st itr  base/branch
activerecord    72.5       2.1         52.0       71.5         2.2         51.9       1.01            1.01
chunky-png      883.7      0.3         41.5       889.0        0.3         43.2       1.00            0.99
erubi-rails     20.8       13.4        91.3       20.5         14.4        90.5       1.01            1.01
hexapdf         2582.1     0.6         182.8      2578.5       0.9         197.2      1.02            1.00
liquid-c        66.0       0.1         33.7       65.4         0.3         34.5       0.99            1.01
liquid-compile  61.8       0.8         32.6       61.9         0.1         31.0       1.06            1.00
liquid-render   165.2      0.2         32.8       164.0        0.4         31.6       1.01            1.01
mail            136.5      0.1         46.5       134.1        0.1         46.9       1.01            1.02
psych-load      2166.1     0.1         33.3       2081.8       0.1         30.7       1.04            1.04
railsbench      2031.9     0.5         89.0       2037.6       0.5         89.3       1.00            1.00
ruby-lsp        66.5       2.9         90.0       65.5         3.0         92.6       1.00            1.01
sequel          73.4       0.9         36.6       73.5         0.9         36.6       1.00            1.00
--------------  ---------  ----------  ---------  -----------  ----------  ---------  --------------  -----------

It looks like there's a small speedup in psych-load and mail. The other benchmarks look largely unchaged. I think it's good to ship this PR.

wks · 2023-07-21T04:00:23Z

I used this microbenchmark:

re = /(\d+),(\d+)/

ARGV[0].to_i.times do |i|
  s = "#{i},#{i+1}"
  m = re.match(s)
end

p GC::stat

I ran it on the same Intel i7-6700k Skylake machine with the same builds. The difference is so obvious that I am not even going to calculate the average of several executions.

Merge base:

{:count=>3676, :time=>1276, :marking_time=>54, :sweeping_time=>1222, :heap_allocated_pages=>34, :heap_sorted_length=>208, :heap_allocatable_pages=>174, :heap_available_slots=>34583, :heap_live_slots=>31881, :heap_free_slots=>2702, :heap_final_slots=>0, :heap_marked_slots=>18324, :heap_eden_pages=>34, :heap_tomb_pages=>0, :total_allocated_pages=>34, :total_freed_pages=>0, :total_allocated_objects=>50069392, :total_freed_objects=>50037511, :malloc_increase_bytes=>50816, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>3673, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>62496, :oldmalloc_increase_bytes_limit=>16777216}

This PR:

{:count=>2781, :time=>1032, :marking_time=>50, :sweeping_time=>982, :heap_allocated_pages=>40, :heap_sorted_length=>208, :heap_allocatable_pages=>168, :heap_available_slots=>39490, :heap_live_slots=>31392, :heap_free_slots=>8098, :heap_final_slots=>0, :heap_marked_slots=>18326, :heap_eden_pages=>40, :heap_tomb_pages=>0, :total_allocated_pages=>40, :total_freed_pages=>0, :total_allocated_objects=>50069391, :total_freed_objects=>50037999, :malloc_increase_bytes=>1056, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>2778, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>1056, :oldmalloc_increase_bytes_limit=>16777216}

There is an obvious drop in :sweeping_time, probably because of less invocations of the free function during obj_free, as a result of less malloc. It can be seen from :malloc_increase_bytes which dropped from 50816 to 1056.

Liquid benchmark: (performance/benchmark.rb)

Merge base:

              parse:     36.423  (± 0.0%) i/s -    366.000  in  10.048831s
             render:    129.945  (± 1.5%) i/s -      1.313k in  10.106120s
     parse & render:     26.893  (± 0.0%) i/s -    270.000  in  10.041370s

This PR:

              parse:     37.327  (± 0.0%) i/s -    375.000  in  10.046230s
             render:    130.546  (± 0.0%) i/s -      1.313k in  10.057756s
     parse & render:     27.647  (± 0.0%) i/s -    278.000  in  10.055679s

The improvement is measurable but not significant. I think it is because the proportion of time spent in allocation and GC is smaller than the micro benchmark, and underlying buffers of Strings and Arrays still dominates the time of sweeping in the Liquid benchmark, making the cost of freeing one of three underlying buffers in MatchData not that obvious.

Embed struct rmatch into GC slot

46858d5

peterzhu2118 reviewed Jul 19, 2023

View reviewed changes

wks added 2 commits July 19, 2023 21:48

Do not count rmatch in obj_memsize_of

f9654b2

... because it is already part of the slot.

Rename struct rmatch to rb_matchext_t

79ad6c8

It is confusing to have both `struct RMatch` and `struct rmatch`. Now we rename `struct rmatch` to `rb_matchext_t`.

peterzhu2118 approved these changes Jul 20, 2023

View reviewed changes

peterzhu2118 merged commit 639aa76 into ruby:master Jul 20, 2023
89 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embed struct rmatch into GC slot #8097

Embed struct rmatch into GC slot #8097

wks commented Jul 19, 2023

peterzhu2118 left a comment

peterzhu2118 Jul 19, 2023

wks Jul 19, 2023

peterzhu2118 Jul 19, 2023

wks Jul 20, 2023

wks commented Jul 20, 2023 •

edited

peterzhu2118 left a comment

wks commented Jul 21, 2023

Embed struct rmatch into GC slot #8097

Embed struct rmatch into GC slot #8097

Conversation

wks commented Jul 19, 2023

peterzhu2118 left a comment

Choose a reason for hiding this comment

peterzhu2118 Jul 19, 2023

Choose a reason for hiding this comment

wks Jul 19, 2023

Choose a reason for hiding this comment

peterzhu2118 Jul 19, 2023

Choose a reason for hiding this comment

wks Jul 20, 2023

Choose a reason for hiding this comment

wks commented Jul 20, 2023 • edited

peterzhu2118 left a comment

Choose a reason for hiding this comment

wks commented Jul 21, 2023

wks commented Jul 20, 2023 •

edited