Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed struct rmatch into GC slot #8097

Merged
merged 3 commits into from Jul 20, 2023
Merged

Conversation

wks
Copy link
Contributor

@wks wks commented Jul 19, 2023

This commit makes use of the variable-width allocation feature, and allocates the struct RMatch and the underlying struct rmatch together as one GC object, similar to how struct RClass and rb_classext_t are allocated together. By making this change, we can reduce one level of indirection when accessing, and also reduce the amount of memory allocated with xmalloc and the necessary xfree to be called.

Some applications (such as Liquid) use regular expressions very frequently, and they generate a large amount of MatchData instances.

Copy link
Member

@peterzhu2118 peterzhu2118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good. Can you also run some of the yjit-bench headline benchmarks and write a few microbenchmarks to measure the performance and memory?

Comment on lines 110 to 115
typedef struct rb_matchext_struct {
/**
* The result of this match.
*/
struct rmatch rmatch;
} rb_matchext_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wrapping the struct rmatch inside of struct rb_matchext_struct, can we just allocate the rmatch after the RMatch? Saves one level of wrapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's easy to do, but I prefer not to remove the struct rb_matchext_struct for two reasons.

  1. The name makes the purpose and the allocation clear to the reader. Developers who are already familiar with rb_classext_t should immediately recognise what rb_matchext_t is for, and where it is allocated.
  2. It makes it easy to extend. In the future, if we need more fields to be allocated together in the GC slot, we can just add more fields to the struct rb_matchext_struct. For example, we can move struct rmatch::char_offset into struct rb_matchext_struct as a field struct rmatch_offset char_offsets[];. (Of course we don't have to do it right now.) If we remove this level of struct, we may need to add more macros to access additional fields. For example, #define RMATCH_CHAR_OFFSETS(m) (struct rmatch_offset*)((char*)m + sizeof(struct RMatch) + sizeof(struct rmatch)). That's not as clear as simply RMATCH_EXT(m)->char_offsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then rename struct rmatch (which, IMO, is a terrible name because it's confusing with struct RMatch) to struct rb_matchext_struct? We can then just add fields to struct rb_matchext_struct in the future if we want to add more fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then rename struct rmatch (which, IMO, is a terrible name because it's confusing with struct RMatch) to struct rb_matchext_struct? We can then just add fields to struct rb_matchext_struct in the future if we want to add more fields.

Good idea. I also feel it confusing to have both struct rmatch and struct RMatch.

gc.c Outdated Show resolved Hide resolved
wks added 2 commits July 19, 2023 21:48
... because it is already part of the slot.
It is confusing to have both `struct RMatch` and `struct rmatch`.  Now
we rename `struct rmatch` to `rb_matchext_t`.
@wks
Copy link
Contributor Author

wks commented Jul 20, 2023

Here are the results of running yjit-bench on an Intel i7-6700k Skylake machine.

The merge base

interp: ruby 3.3.0dev (2023-07-19T03:42:20Z :detached: a3a74771f2) [x86_64-linux]
yjit: ruby 3.3.0dev (2023-07-19T03:42:20Z :detached: a3a74771f2) +YJIT [x86_64-linux]

--------------  -----------  ----------  ---------  ----------  ------------  -----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
activerecord    76.1         2.5         39.0       4.5         1.49          1.95       
chunky-png      782.1        0.2         472.0      0.4         1.54          1.66       
erubi-rails     22.2         11.8        13.6       20.7        0.29          1.62       
hexapdf         2660.7       1.2         1701.8     1.9         1.31          1.56       
liquid-c        66.2         0.3         46.2       1.0         0.74          1.43       
liquid-compile  62.0         3.6         44.6       2.9         0.66          1.39       
liquid-render   157.0        1.8         81.3       1.4         1.16          1.93       
mail            137.0        0.9         101.8      0.2         0.71          1.35       
psych-load      1954.6       0.1         1365.0     0.5         1.41          1.43       
railsbench      2233.1       0.4         1499.3     0.8         1.21          1.49       
ruby-lsp        66.6         2.9         45.6       25.3        0.57          1.46       
sequel          77.4         1.0         59.0       3.1         1.27          1.31       
binarytrees     370.7        1.3         171.1      1.5         2.09          2.17       
erubi           224.0        0.1         178.6      0.1         1.23          1.25       
etanni          307.4        0.0         307.7      0.1         1.00          1.00       
fannkuchredux   1668.8       0.6         550.4      0.1         1.00          3.03       
lee             981.3        0.9         673.1      1.0         1.42          1.46       
nbody           105.0        0.1         55.1       0.1         1.79          1.91       
optcarrot       4802.2       0.6         1716.1     0.5         2.61          2.80       
rack            105.7        1.1         86.4       1.5         1.12          1.22       
ruby-json       3001.5       0.0         2576.1     0.2         1.16          1.17       
rubykon         10105.9      0.4         5086.3     0.7         2.03          1.99       
30k_ifelse      2316.9       0.0         237.4      0.0         1.35          9.76       
30k_methods     5786.5       0.0         579.2      0.0         5.33          9.99       
cfunc_itself    89.9         0.3         27.3       0.2         3.19          3.30       
fib             197.7        0.1         33.0       0.1         5.90          5.99       
getivar         97.9         0.1         18.6       75.7        1.00          5.27       
keyword_args    219.7        1.7         34.6       0.3         6.09          6.35       
respond_to      223.7        0.0         18.3       0.5         11.79         12.24      
setivar         60.9         0.2         9.8        64.5        1.00          6.21       
setivar_object  94.1         0.8         38.0       38.2        1.00          2.47       
setivar_young   94.3         0.9         36.4       40.3        0.99          2.59       
str_concat      65.0         3.1         30.8       0.3         1.57          2.12       
throw           23.1         0.2         18.2       0.3         1.22          1.26       
--------------  -----------  ----------  ---------  ----------  ------------  -----------
Legend:
- yjit 1st itr: ratio of interp/yjit time for the first benchmarking iteration.
- interp/yjit: ratio of interp/yjit time. Higher is better for yjit. Above 1 represents a speedup.

This PR:

interp: ruby 3.3.0dev (2023-07-19T13:48:16Z :detached: f9654b2b28) [x86_64-linux]
yjit: ruby 3.3.0dev (2023-07-19T13:48:16Z :detached: f9654b2b28) +YJIT [x86_64-linux]

--------------  -----------  ----------  ---------  ----------  ------------  -----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
activerecord    76.7         2.3         39.0       4.5         1.48          1.97       
chunky-png      756.5        0.2         464.7      0.4         1.51          1.63       
erubi-rails     22.2         11.9        13.7       19.4        0.35          1.62       
hexapdf         2639.6       1.3         1693.3     0.9         1.33          1.56       
liquid-c        65.1         0.4         45.7       1.1         0.74          1.42       
liquid-compile  62.5         0.3         45.7       0.4         0.65          1.37       
liquid-render   157.4        0.4         82.8       0.2         1.16          1.90       
mail            134.6        0.1         99.6       0.1         0.71          1.35       
psych-load      1905.8       0.1         1335.0     0.2         1.41          1.43       
railsbench      2241.9       0.6         1503.6     0.9         1.23          1.49       
ruby-lsp        65.8         2.9         45.5       25.6        0.57          1.45       
sequel          76.4         1.0         57.4       1.6         1.29          1.33       
binarytrees     366.1        1.4         175.0      2.0         2.04          2.09       
erubi           225.4        0.0         179.9      0.1         1.22          1.25       
etanni          311.1        0.0         309.1      0.1         1.00          1.01       
fannkuchredux   1655.8       0.2         551.7      0.1         1.00          3.00       
lee             971.6        1.0         679.0      1.2         1.39          1.43       
nbody           104.7        0.2         51.5       0.0         1.89          2.03       
optcarrot       4789.8       0.6         1708.4     0.6         2.62          2.80       
rack            103.5        1.2         82.2       1.6         1.17          1.26       
ruby-json       2986.7       0.7         2597.1     0.1         1.15          1.15       
rubykon         10048.7      1.1         5082.7     0.4         2.03          1.98       
30k_ifelse      2327.3       0.1         237.3      0.1         1.34          9.81       
30k_methods     5785.9       0.0         579.2      0.0         5.30          9.99       
cfunc_itself    80.5         0.4         27.7       0.5         2.83          2.91       
fib             195.1        0.1         33.0       0.1         5.83          5.90       
getivar         86.4         0.4         18.1       63.9        1.03          4.77       
keyword_args    220.3        0.0         34.3       0.3         6.19          6.41       
respond_to      212.7        0.2         19.9       1.2         10.29         10.71      
setivar         50.9         0.6         9.6        52.2        1.00          5.28       
setivar_object  81.5         0.7         33.8       33.4        1.02          2.41       
setivar_young   81.2         0.6         33.9       33.4        1.00          2.39       
str_concat      62.0         0.3         32.4       0.3         1.47          1.91       
throw           23.1         0.2         18.1       0.3         1.23          1.28       
--------------  -----------  ----------  ---------  ----------  ------------  -----------
Legend:
- yjit 1st itr: ratio of interp/yjit time for the first benchmarking iteration.
- interp/yjit: ratio of interp/yjit time. Higher is better for yjit. Above 1 represents a speedup.

The following table shows the ratio between the values of this PR and the merge base (<1.0 means speed-up and >1.0 means slow down) . The last row "(geomean)" is the geometric mean of other rows.

bench interp:base interp:pr interp:ratio yjit:base yjit:pr yjit:ratio
30k_ifelse 2316.9 2327.3 1.004 237.4 237.3 1
30k_methods 5786.5 5785.9 1 579.2 579.2 1
activerecord 76.1 76.7 1.007 39 39 0.999
binarytrees 370.7 366.1 0.988 171.1 175 1.023
cfunc_itself 89.9 80.5 0.895 27.3 27.7 1.015
chunky-png 782.1 756.5 0.967 472 464.7 0.985
erubi 224 225.4 1.006 178.6 179.9 1.007
erubi-rails 22.2 22.2 1.002 13.6 13.7 1.003
etanni 307.4 311.1 1.012 307.7 309.1 1.005
fannkuchredux 1668.8 1655.8 0.992 550.4 551.7 1.002
fib 197.7 195.1 0.987 33 33 1.001
getivar 97.9 86.4 0.883 18.6 18.1 0.976
hexapdf 2660.7 2639.6 0.992 1701.8 1693.3 0.995
keyword_args 219.7 220.3 1.003 34.6 34.3 0.992
lee 981.3 971.6 0.99 673.1 679 1.009
liquid-c 66.2 65.1 0.984 46.2 45.7 0.99
liquid-compile 62 62.5 1.008 44.6 45.7 1.025
liquid-render 157 157.4 1.003 81.3 82.8 1.018
mail 137 134.6 0.982 101.8 99.6 0.979
nbody 105 104.7 0.997 55.1 51.5 0.936
optcarrot 4802.2 4789.8 0.997 1716.1 1708.4 0.995
psych-load 1954.6 1905.8 0.975 1365 1335 0.978
rack 105.7 103.5 0.979 86.4 82.2 0.951
railsbench 2233.1 2241.9 1.004 1499.3 1503.6 1.003
respond_to 223.7 212.7 0.951 18.3 19.9 1.087
ruby-json 3001.5 2986.7 0.995 2576.1 2597.1 1.008
ruby-lsp 66.6 65.8 0.988 45.6 45.5 0.997
rubykon 10105.9 10048.7 0.994 5086.3 5082.7 0.999
sequel 77.4 76.4 0.986 59 57.4 0.974
setivar 60.9 50.9 0.835 9.8 9.6 0.982
setivar_object 94.1 81.5 0.866 38 33.8 0.889
setivar_young 94.3 81.2 0.861 36.4 33.9 0.931
str_concat 65 62 0.953 30.8 32.4 1.055
throw 23.1 23.1 1.003 18.2 18.1 0.991
(geomean) 288.4 280.4 0.972 124.4 123.6 0.993

The difference is small for most benchmarks. Some benchmarks has noticeable improvement (ratio < 0.9) in interpreter time, including cfunc_itself, getivar and setivar*. It's hard to explain because those benchmarks are not related to regular expressions. Maybe the parser used by the compiler uses regular expression under the hood, resulting in the improvement being visible elsewhere.

I'll try to run liquid again, and make some microbenchmarks that make heavy use of regular expressions.

Copy link
Member

@peterzhu2118 peterzhu2118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran benchmarks of this branch against the base branch on my machine (AMD Ryzen 3600X).

This PR shouldn't affect the performance of YJIT so it doesn't make much sense to compare with vs. without YJIT. The following command compares between different Ruby builds:

./run_benchmarks.rb --headline -e "base::/home/peter/src/ruby-master/install/bin/ruby" -e "branch::/home/peter/src/ruby/install/bin/ruby" --rss

I get the following results:

--------------  ---------  ----------  ---------  -----------  ----------  ---------  --------------  -----------
bench           base (ms)  stddev (%)  RSS (MiB)  branch (ms)  stddev (%)  RSS (MiB)  branch 1st itr  base/branch
activerecord    72.5       2.1         52.0       71.5         2.2         51.9       1.01            1.01
chunky-png      883.7      0.3         41.5       889.0        0.3         43.2       1.00            0.99
erubi-rails     20.8       13.4        91.3       20.5         14.4        90.5       1.01            1.01
hexapdf         2582.1     0.6         182.8      2578.5       0.9         197.2      1.02            1.00
liquid-c        66.0       0.1         33.7       65.4         0.3         34.5       0.99            1.01
liquid-compile  61.8       0.8         32.6       61.9         0.1         31.0       1.06            1.00
liquid-render   165.2      0.2         32.8       164.0        0.4         31.6       1.01            1.01
mail            136.5      0.1         46.5       134.1        0.1         46.9       1.01            1.02
psych-load      2166.1     0.1         33.3       2081.8       0.1         30.7       1.04            1.04
railsbench      2031.9     0.5         89.0       2037.6       0.5         89.3       1.00            1.00
ruby-lsp        66.5       2.9         90.0       65.5         3.0         92.6       1.00            1.01
sequel          73.4       0.9         36.6       73.5         0.9         36.6       1.00            1.00
--------------  ---------  ----------  ---------  -----------  ----------  ---------  --------------  -----------

It looks like there's a small speedup in psych-load and mail. The other benchmarks look largely unchaged. I think it's good to ship this PR.

@peterzhu2118 peterzhu2118 merged commit 639aa76 into ruby:master Jul 20, 2023
89 checks passed
@wks
Copy link
Contributor Author

wks commented Jul 21, 2023

I used this microbenchmark:

re = /(\d+),(\d+)/

ARGV[0].to_i.times do |i|
  s = "#{i},#{i+1}"
  m = re.match(s)
end

p GC::stat

I ran it on the same Intel i7-6700k Skylake machine with the same builds. The difference is so obvious that I am not even going to calculate the average of several executions.

Merge base:

{:count=>3676, :time=>1276, :marking_time=>54, :sweeping_time=>1222, :heap_allocated_pages=>34, :heap_sorted_length=>208, :heap_allocatable_pages=>174, :heap_available_slots=>34583, :heap_live_slots=>31881, :heap_free_slots=>2702, :heap_final_slots=>0, :heap_marked_slots=>18324, :heap_eden_pages=>34, :heap_tomb_pages=>0, :total_allocated_pages=>34, :total_freed_pages=>0, :total_allocated_objects=>50069392, :total_freed_objects=>50037511, :malloc_increase_bytes=>50816, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>3673, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>62496, :oldmalloc_increase_bytes_limit=>16777216}

This PR:

{:count=>2781, :time=>1032, :marking_time=>50, :sweeping_time=>982, :heap_allocated_pages=>40, :heap_sorted_length=>208, :heap_allocatable_pages=>168, :heap_available_slots=>39490, :heap_live_slots=>31392, :heap_free_slots=>8098, :heap_final_slots=>0, :heap_marked_slots=>18326, :heap_eden_pages=>40, :heap_tomb_pages=>0, :total_allocated_pages=>40, :total_freed_pages=>0, :total_allocated_objects=>50069391, :total_freed_objects=>50037999, :malloc_increase_bytes=>1056, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>2778, :major_gc_count=>3, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>0, :remembered_wb_unprotected_objects_limit=>183, :old_objects=>18300, :old_objects_limit=>36600, :oldmalloc_increase_bytes=>1056, :oldmalloc_increase_bytes_limit=>16777216}

There is an obvious drop in :sweeping_time, probably because of less invocations of the free function during obj_free, as a result of less malloc. It can be seen from :malloc_increase_bytes which dropped from 50816 to 1056.

Liquid benchmark: (performance/benchmark.rb)

Merge base:

              parse:     36.423  (± 0.0%) i/s -    366.000  in  10.048831s
             render:    129.945  (± 1.5%) i/s -      1.313k in  10.106120s
     parse & render:     26.893  (± 0.0%) i/s -    270.000  in  10.041370s

This PR:

              parse:     37.327  (± 0.0%) i/s -    375.000  in  10.046230s
             render:    130.546  (± 0.0%) i/s -      1.313k in  10.057756s
     parse & render:     27.647  (± 0.0%) i/s -    278.000  in  10.055679s

The improvement is measurable but not significant. I think it is because the proportion of time spent in allocation and GC is smaller than the micro benchmark, and underlying buffers of Strings and Arrays still dominates the time of sweeping in the Liquid benchmark, making the cost of freeing one of three underlying buffers in MatchData not that obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants