Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send-pop optimisation #2100

Open
wants to merge 20 commits into
base: master
from

Conversation

@shyouhei
Copy link
Member

commented Mar 19, 2019

Abstract

"Calling a method, then immediately discarding its return value(s)" is one of the most frequent operations that the interpreter does. Why not optimise this. In doing so, we implement the following two techniques;

  • Allow methods to omit returning values when one, if any, is not used.
  • Combine the send-then-pop sequence into one.

We can observe improvements on not only microbenchmarks but also a few non-micro ones with the above changes.

Introduction

Ruby do not force you a style of writing. In Ruby, one thing tends to be doable in more than one ways. This is considered to be a good thing. To make it possible, a return value of a method is not forced to be used: every method can (and does) return possibly multiple values, while its callers are free to ignore them. Even when a method does not expect its callers to take any return values, it tends to return something meaningful "just in case" the expectation breaks.

However, these "just in case" return values rarely gets used in practice. Most of the time they are just silently ignored. They become instant garbage unless referenced elsewhere; which is of course a waste of both time and space. There is a room of improvements around this area.

How often does this happen? We can observe it in the following scheme. The interpreter implements its instructions and runs them in series. This series can be seen as a conceptual language, and its 2-gram can be thought of. By taking such 2-grams of the entire execution of a Ruby program, we can see the ratio of the operation we are talking about.

Following is the top 10 list of 2 grams of the entire execution of mame/optcarrot benchmark:
zsh % LANG=C wc -l 2gram.txt
1143155369
zsh % LANG=C sort 2gram.txt | uniq -c | sort -nr | head -n 10
69065813 getinstancevariable -> getinstancevariable
65600442 putself -> getinstancevariable
59624140 getinstancevariable -> branchunless
59116388 branchunless -> getinstancevariable
52828407 leave -> pop
50434175 getinstancevariable -> putobject
30368815 pop -> putself
27717161 setinstancevariable -> getinstancevariable
25661090 branchunless -> putself
25165032 getinstancevariable -> branchif

Here, the leave instruction (almost) resembles ruby's return statement, and the pop instruction (almost) resembles ruby's ';' delimiter. So the "leave -> pop" output indicates that a method returns a value, and that value is not used. It seems such situation is # 5 most frequent operation in the entire execution of a program, which is about 4.6% of the whole.

Telling methods that their return values are not used

The first step to remedy the situation is the introduction of a new method calling convention to allow methods to return arbitrary return values when not used. We do not force them to eliminate unused return values. This is because at the beginning every method in the wild -- especially those written in C -- already returns something. In order not to break existing codes, methods must be allowed to return values even if they are discarded. However, for new ones, let us make room for optimisations.

This is done by setting a 1-bit flag in a method stack frame. Every time a method is called, several flags are set in the VM's stack. We add a flag called VM_FRAME_FLAG_POPPED which denotes that the return value is not used.
diff --git a/vm_core.h b/vm_core.h
index 574837dea0..513b8b85c1 100644
--- a/vm_core.h
+++ b/vm_core.h
@@ -1132,11 +1133,11 @@ typedef rb_control_frame_t *

 enum {
     /* Frame/Environment flag bits:
-     *   MMMM MMMM MMMM MMMM ____ __FF FFFF EEEX (LSB)
+     *   MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB)
      *
      * X   : tag for GC marking (It seems as Fixnum)
      * EEE : 3 bits Env flags
-     * FF..: 6 bits Frame flags
+     * FF..: 7 bits Frame flags
      * MM..: 15 bits frame magic (to check frame corruption)
      */

@@ -1160,6 +1161,7 @@ enum {
     VM_FRAME_FLAG_CFRAME    = 0x0080,
     VM_FRAME_FLAG_LAMBDA    = 0x0100,
     VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200,
+    VM_FRAME_FLAG_POPPED    = 0x0400,

     /* env flag */
     VM_ENV_FLAG_LOCAL       = 0x0002,

Whether we should set this flag or not is determined by the program counter (PC hereafter). When the PC reaches to a method invocation, and the instruction very next to the current PC is pop, that is where we should set this flag.

Automatic abortion of a method using the flag

Now. The usages of return values are passed to every method. However, that itself does not speed things up. We want to make methods faster, automatically, without any modifications of the program.

Consider the following program:

def foo(x)
  y = bar(x)
  return y
end
Without the propsed changeset, this method is compiled into:
== disasm: #<ISeq:foo@<compiled>:1 (1,2)-(4,5)> (catch: FALSE)
local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] x@0<Arg>   [ 1] y@1
0000 putself                                                          (   2)[LiCa]
0001 getlocal                     x@0, 0
0004 send                         <callinfo!mid:bar, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, nil
0008 setlocal                     y@1, 0
0011 getlocal                     y@1, 0                              (   3)[Li]
0014 leave                                                            (   4)[Re]

Here, now that we can say if the return value is used or not. If it is not used, we can safely say that the last three instructions ("setlocal - getlocal - leave" sequence) have no meaning. On the other hand the call to bar might or might not make sense, depending on how bar behaves.

Thus, we can insert a flag check between send and setlocal, like this:
== disasm: #<ISeq:foo@<compiled>:1 (1,2)-(4,5)> (catch: FALSE)
local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] x@0<Arg>   [ 1] y@1
0000 putself                                                          (   2)[LiCa]
0001 getlocal                               x@0, 0
0004 send                                   <callinfo!mid:bar, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, nil
0008 opt_bailout                            1
0010 setlocal                               y@1, 0
0013 getlocal                               y@1, 0                    (   3)[Li]
0016 leave                                                            (   4)[Re]
Note the newly introduced `opt_bailout` instruction at PC 8. Here is the implementation:
diff --git a/insns.def b/insns.def
index 2e7d39ec17..fcbd352c5c 100644
--- a/insns.def
+++ b/insns.def
@@ -946,6 +962,29 @@ leave
     }
 }
 
+/* This instruction is no-op unless the instruction sequence is called
+ * with VM_FRAME_FLAG_POPPED.  With that flag on, it immediately
+ * leaves the current stack frame with scratching the topmost n stack
+ * values.  The return value of the iseq for that case is always
+ * nil. */
+DEFINE_INSN
+opt_bailout
+(rb_num_t n)
+()
+()
+{
+#ifdef MJIT_HEADER
+    /* :FIXME: don't know how to make it work with JIT... */
+#else
+    if (VM_ENV_FLAGS(GET_EP(), VM_FRAME_FLAG_POPPED) &&
+        CURRENT_INSN_IS(opt_bailout) /* <- rule out trace instruction */ ) {
+        POPN(n);
+        PUSH(Qnil);
+        DISPATCH_ORIGINAL_INSN(leave);
+    }
+#endif
+}
+
 /**********************************************************/
 /* deal with control flow 3: exception                    */
 /**********************************************************/

Determining where to add the new instruction

The above example shows a typical situation where opt_bailout is useful. We want to do this optimisation wherever possible. In order to do so, we have to define what is safe to optimise and what is not.

After few moments thinking about this topic, we conclude that this specific question is identical to what we proposed in #1943. #1943 was about skipping the entire method execution if a method is entirely pure.

The details (definition etc.) of the "purity" we are talking about can be found in the previous proposal.

In this proposal, on the other hand, we do not restrict the target method to those which are entirely pure. We also do not restrict the skip target to be the method as a whole. Instead, we scan the method's instruction sequence from back to forth to seek out the point where the last non-pure instruction appears. That point should be where opt_bailout can reside. All instructions afterwards are subject to be skipped, in case the return value is not used.

C API for the flag

So far we have modified the way of execution of methods written in Ruby. That of methods written in C is not modified -- and not modifiable automatically like we did above. However, the VM_FRAME_FLAG_POPPED flag is set anyways, even when the calling method is written in C. If C source codes are allowed to be modified, interfacing this flag from C code shall make extra rooms of optimisation.

Let us provide an API to see the flag, so that future C methods can benefit.
diff --git a/include/ruby/intern.h b/include/ruby/intern.h
index 17aafd7f8e..f13c2bb941 100644
--- a/include/ruby/intern.h
+++ b/include/ruby/intern.h
@@ -992,6 +992,8 @@ VALUE rb_time_succ(VALUE);
 VALUE rb_make_backtrace(void);
 VALUE rb_make_exception(int, const VALUE*);

+int rb_whether_the_return_value_is_used_p(void);
+
 RUBY_SYMBOL_EXPORT_END

 #if defined(__cplusplus)
diff --git a/vm.c b/vm.c
index c5beed64c0..d33ff98619 100644
--- a/vm.c
+++ b/vm.c
@@ -3544,4 +3544,14 @@ vm_collect_usage_register(int reg, int isset)

 #endif /* #ifndef MJIT_HEADER */

+int
+rb_whether_the_return_value_is_used_p(void)
+{
+    const struct rb_execution_context_struct *ec = GET_EC();
+    const struct rb_control_frame_struct *reg_cfp = ec->cfp;
+    const VALUE *ep = GET_EP();
+
+    return ! VM_ENV_FLAGS(ep, VM_FRAME_FLAG_POPPED);
+}
+
 #include "vm_call_iseq_optimized.inc" /* required from vm_insnhelper.c */

One method that can benefit is StringScanner#scan. It returns a String, at the same time it modifies its receiver's internals. The return value might not always be necessary. Generation of such waste can be avoided by looking at the flag.

Deleting the pop instruction

We have made arbitrary return values possible when they are not used by the caller. The value, however, is still pushed into the stack top and then popped immediately. Let us optimise this part.

That said, the optimisation might not always be possible. The pop instruction immediately following a send can also be a jump destination. On such situation, although the method can return arbitrary values, the pop cannot be eliminated.

We want to rule out such cases from optimisations. Let us introduce another VM stack frame flag to distinguish the two:
diff --git a/vm_core.h b/vm_core.h
index 513b8b85c1..5d2500d187 100644
--- a/vm_core.h
+++ b/vm_core.h
@@ -1133,11 +1135,11 @@ typedef rb_control_frame_t *

 enum {
     /* Frame/Environment flag bits:
-     *   MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB)
+     *   MMMM MMMM MMMM MMMM ____ FFFF FFFF EEEX (LSB)
      *
      * X   : tag for GC marking (It seems as Fixnum)
      * EEE : 3 bits Env flags
-     * FF..: 7 bits Frame flags
+     * FF..: 8 bits Frame flags
      * MM..: 15 bits frame magic (to check frame corruption)
      */

@@ -1162,6 +1164,7 @@ enum {
     VM_FRAME_FLAG_LAMBDA    = 0x0100,
     VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200,
     VM_FRAME_FLAG_POPPED    = 0x0400,
+    VM_FRAME_FLAG_POPIT     = 0x0800,

     /* env flag */
     VM_ENV_FLAG_LOCAL       = 0x0002,

With this flag set, it is a callee's duty, not a caller's, to properly avoid generating return values. How? There are (surprisingly) three patterns of returning values from a method.

  1. Methods written in C: they return their return values using C's return semantics. Just skip pushing the value onto the VM's stack should suffice. This is the simplest situation among others.

  2. Methods written in Ruby without any block invocations: their return values are pushed onto the stack in their leave instructions. The instruction has to be modified to check the flag.

  3. Methods can also return from the inside of a block.

    This is complicated. For instance:
    def foo
      1.times do |x|
        return x
      end
    end

    This return returns from the foo method. On the other hand,

    def foo
      1.times &-> (x) do
        1.times do |y|
          return [x, y]
        end
      end
    end

    This return represents that of lambda, not that of the entire method. So a return inside of a block has to unwind the stack dynamically to find the place where VM should continue its execution. This was possible before because what to do after a return statement was consistent (push the return value). Now we are going to eliminate the pop instruction. Some sort of continuation shall be preserved or reconstructed; we chose the latter way. We changed the stack unwinding routine so that enough information to reconstruct the continuation can be collected.

Experiments

The proposed changeset applies to trunk r67168. We compare some benchmark results before/after applying 312c580 . All results are benchmarked on a Linux VM hosted on a Windows 10, on a ThinkPad laptop.

Results of make benchmark

This set of benchmark is classified as microbenchmarks; consists of many small Ruby snippets. They tend to show speeds of each specific area of Ruby. In most instances, the results differ a negligible amount of seconds. There are a few exceptional benchmarks that the proposed changeset clearly outperforms trunk. This pattern roughly resembles #1943.
Warming up --------------------------------------
                                     (1..1_000_000).last(100)     1.026M i/s -      1.108M times in 1.079327s (974.27ns/i)
                                    (1..1_000_000).last(1000)   104.974k i/s -    110.627k times in 1.053850s (9.53μs/i)
                                   (1..1_000_000).last(10000)    10.689k i/s -     11.220k times in 1.049692s (93.56μs/i)
Time.strptime("28/Aug/2005:06:54:20 +0000", "%d/%b/%Y:%T %z")   155.840k i/s -    163.295k times in 1.047840s (6.42μs/i)
                                     Time.strptime("1", "%s")     1.580M i/s -      1.630M times in 1.031937s (633.01ns/i)
                            Time.strptime("0 +0100", "%s %z")   223.696k i/s -    226.787k times in 1.013819s (4.47μs/i)
                              Time.strptime("0 UTC", "%s %z")   487.990k i/s -    515.160k times in 1.055678s (2.05μs/i)
                                Time.strptime("1.5", "%s.%N")     1.137M i/s -      1.140M times in 1.003196s (879.83ns/i)
                     Time.strptime("1.000000000001", "%s.%N")   650.423k i/s -    691.536k times in 1.063210s (1.54μs/i)
                 Time.strptime("20010203 -0200", "%Y%m%d %z")   150.868k i/s -    159.368k times in 1.056341s (6.63μs/i)
                   Time.strptime("20010203 UTC", "%Y%m%d %z")   219.247k i/s -    232.760k times in 1.061635s (4.56μs/i)
                           Time.strptime("2018-365", "%Y-%j")   130.456k i/s -    138.468k times in 1.061418s (7.67μs/i)
                           Time.strptime("2018-091", "%Y-%j")   135.548k i/s -    135.575k times in 1.000199s (7.38μs/i)
Calculating -------------------------------------
                                                                   trunk        ours
                                                   app_answer     53.671      51.241 i/s -       1.000 times in 0.018632s 0.019515s
                                                  app_aobench      0.026       0.027 i/s -       1.000 times in 37.993009s 37.296158s
                                                app_factorial      1.591       1.887 i/s -       1.000 times in 0.628369s 0.530042s
                                                      app_fib      3.343       3.310 i/s -       1.000 times in 0.299150s 0.302090s
                                              app_lc_fizzbuzz      0.047       0.048 i/s -       1.000 times in 21.227309s 20.751124s
                                               app_mandelbrot      1.887       1.868 i/s -       1.000 times in 0.529970s 0.535473s
                                                app_pentomino      0.102       0.103 i/s -       1.000 times in 9.815870s 9.706780s
                                                    app_raise      7.903       7.921 i/s -       1.000 times in 0.126531s 0.126250s
                                                app_strconcat      2.685       2.617 i/s -       1.000 times in 0.372384s 0.382045s
                                                      app_tak      2.372       2.165 i/s -       1.000 times in 0.421534s 0.461919s
                                                    app_tarai      3.056       2.687 i/s -       1.000 times in 0.327201s 0.372188s
                                                      app_uri      2.614       2.668 i/s -       1.000 times in 0.382493s 0.374826s
                                         array_sample_100k_10    154.116     157.748 i/s -       1.000 times in 0.006489s 0.006339s
                                         array_sample_100k_11    105.786      96.265 i/s -       1.000 times in 0.009453s 0.010388s
                                        array_sample_100k__1k      2.291       2.275 i/s -       1.000 times in 0.436564s 0.439523s
                                        array_sample_100k__6k      0.621       0.587 i/s -       1.000 times in 1.609796s 1.702779s
                                       array_sample_100k__100     20.391      19.665 i/s -       1.000 times in 0.049040s 0.050852s
                                      array_sample_100k___10k      0.453       0.430 i/s -       1.000 times in 2.209507s 2.324998s
                                      array_sample_100k___50k      0.124       0.118 i/s -       1.000 times in 8.087241s 8.473559s
                                                  array_shift      0.423       0.500 i/s -       1.000 times in 2.366338s 1.998473s
                                              array_small_and     90.437      82.361 i/s -       1.000 times in 0.011057s 0.012142s
                                             array_small_diff     85.459      82.879 i/s -       1.000 times in 0.011702s 0.012066s
                                               array_small_or     58.742      55.923 i/s -       1.000 times in 0.017024s 0.017882s
                                             array_sort_block      0.200       0.209 i/s -       1.000 times in 5.002746s 4.779594s
                                             array_sort_float      0.682       0.758 i/s -       1.000 times in 1.466914s 1.318983s
                                          array_values_at_int    132.476     140.026 i/s -       1.000 times in 0.007549s 0.007142s
                                        array_values_at_range      4.714       4.949 i/s -       1.000 times in 0.212129s 0.202072s
                                                      bighash      0.730       0.739 i/s -       1.000 times in 1.369398s 1.353203s
                                                  dir_empty_p      3.246       3.262 i/s -       1.000 times in 0.308045s 0.306577s
                                          enum_lazy_grep_v_20      7.047       7.071 i/s -       1.000 times in 0.141910s 0.141421s
                                          enum_lazy_grep_v_50      5.245       5.294 i/s -       1.000 times in 0.190656s 0.188902s
                                         enum_lazy_grep_v_100      3.821       3.938 i/s -       1.000 times in 0.261735s 0.253933s
                                            enum_lazy_uniq_20      6.163       6.197 i/s -       1.000 times in 0.162250s 0.161365s
                                            enum_lazy_uniq_50      4.487       4.492 i/s -       1.000 times in 0.222883s 0.222615s
                                           enum_lazy_uniq_100      3.043       3.075 i/s -       1.000 times in 0.328673s 0.325191s
                                                  fiber_chain      0.907       0.940 i/s -       1.000 times in 1.102930s 1.063456s
                                                   file_chmod      3.286       3.340 i/s -       1.000 times in 0.304306s 0.299428s
                                                  file_rename      0.402       0.435 i/s -       1.000 times in 2.487555s 2.299141s
                                               hash_aref_dsym      4.014       4.179 i/s -       1.000 times in 0.249099s 0.239310s
                                          hash_aref_dsym_long      0.260       0.259 i/s -       1.000 times in 3.851448s 3.867149s
                                                hash_aref_fix      4.232       4.401 i/s -       1.000 times in 0.236297s 0.227206s
                                                hash_aref_flo     38.119      37.916 i/s -       1.000 times in 0.026234s 0.026374s
                                               hash_aref_miss      3.136       3.240 i/s -       1.000 times in 0.318839s 0.308622s
                                                hash_aref_str      3.519       3.537 i/s -       1.000 times in 0.284198s 0.282758s
                                                hash_aref_sym      3.946       4.144 i/s -       1.000 times in 0.253440s 0.241315s
                                           hash_aref_sym_long      2.753       2.901 i/s -       1.000 times in 0.363272s 0.344678s
                                                 hash_flatten      6.983       7.044 i/s -       1.000 times in 0.143209s 0.141970s
                                               hash_ident_flo     40.631      37.819 i/s -       1.000 times in 0.024612s 0.026442s
                                               hash_ident_num      4.302       4.452 i/s -       1.000 times in 0.232476s 0.224611s
                                               hash_ident_obj      4.262       4.378 i/s -       1.000 times in 0.234619s 0.228394s
                                               hash_ident_str      4.240       4.424 i/s -       1.000 times in 0.235826s 0.226019s
                                               hash_ident_sym      4.205       4.377 i/s -       1.000 times in 0.237822s 0.228445s
                                                    hash_keys     12.017      11.953 i/s -       1.000 times in 0.083215s 0.083660s
                                          hash_literal_small2      2.013       2.057 i/s -       1.000 times in 0.496785s 0.486182s
                                          hash_literal_small4      1.714       1.691 i/s -       1.000 times in 0.583331s 0.591259s
                                          hash_literal_small8      1.249       1.242 i/s -       1.000 times in 0.800458s 0.804839s
                                                    hash_long      2.042       2.023 i/s -       1.000 times in 0.489629s 0.494231s
                                                   hash_shift    127.349     126.173 i/s -       1.000 times in 0.007852s 0.007926s
                                               hash_shift_u16     20.627      20.992 i/s -       1.000 times in 0.048481s 0.047638s
                                               hash_shift_u24     20.434      21.774 i/s -       1.000 times in 0.048938s 0.045927s
                                               hash_shift_u32     19.649      20.663 i/s -       1.000 times in 0.050893s 0.048395s
                                                  hash_small2      1.950       1.973 i/s -       1.000 times in 0.512827s 0.506914s
                                                  hash_small4      1.525       1.560 i/s -       1.000 times in 0.655531s 0.640959s
                                                  hash_small8      1.072       1.113 i/s -       1.000 times in 0.932868s 0.898842s
                                                 hash_to_proc    399.695     406.288 i/s -       1.000 times in 0.002502s 0.002461s
                                                  hash_values     11.924      11.785 i/s -       1.000 times in 0.083865s 0.084854s
                                                      int_quo      1.289       1.325 i/s -       1.000 times in 0.776054s 0.754653s
                                         io_copy_stream_write      5.982       6.170 i/s -       1.000 times in 0.167168s 0.162065s
                                  io_copy_stream_write_socket      2.813       2.955 i/s -       1.000 times in 0.355464s 0.338435s
                                               io_file_create      0.917       0.952 i/s -       1.000 times in 1.090386s 1.050074s
                                                 io_file_read      1.105       1.101 i/s -       1.000 times in 0.904766s 0.907911s
                                                io_file_write      1.561       1.575 i/s -       1.000 times in 0.640675s 0.634840s
                                             io_nonblock_noex      0.615       0.620 i/s -       1.000 times in 1.626202s 1.612408s
                                            io_nonblock_noex2      0.704       0.699 i/s -       1.000 times in 1.421363s 1.430235s
                                                   io_pipe_rw      0.981       0.993 i/s -       1.000 times in 1.019529s 1.007359s
                                                    io_select      0.666       0.672 i/s -       1.000 times in 1.500611s 1.487103s
                                                   io_select2      0.576       0.570 i/s -       1.000 times in 1.736614s 1.753484s
                                                   io_select3     94.518      86.204 i/s -       1.000 times in 0.010580s 0.011600s
                                                     loop_for      1.160       1.184 i/s -       1.000 times in 0.861942s 0.844259s
                                               loop_generator      8.678       9.194 i/s -       1.000 times in 0.115229s 0.108771s
                                                   loop_times      1.266       1.319 i/s -       1.000 times in 0.789672s 0.757930s
                                               loop_whileloop      2.870       2.731 i/s -       1.000 times in 0.348402s 0.366221s
                                              loop_whileloop2     14.229      13.640 i/s -       1.000 times in 0.070279s 0.073316s
                                             marshal_dump_flo      5.491       5.708 i/s -       1.000 times in 0.182103s 0.175208s
                                      marshal_dump_load_geniv      3.431       3.474 i/s -       1.000 times in 0.291450s 0.287885s
                                       marshal_dump_load_time      1.542       1.590 i/s -       1.000 times in 0.648366s 0.629070s
                                                 securerandom      6.333       6.447 i/s -       1.000 times in 0.157897s 0.155111s
                                                 so_ackermann      3.371       3.167 i/s -       1.000 times in 0.296675s 0.315793s
                                                     so_array      1.706       1.787 i/s -       1.000 times in 0.586306s 0.559484s
                                              so_binary_trees      0.249       0.251 i/s -       1.000 times in 4.014320s 3.984743s
                                               so_concatenate      0.389       0.383 i/s -       1.000 times in 2.571954s 2.610059s
                                                 so_exception      5.742       5.993 i/s -       1.000 times in 0.174167s 0.166850s
                                                  so_fannkuch      1.902       1.879 i/s -       1.000 times in 0.525716s 0.532184s
                                                     so_fasta      0.727       0.752 i/s -       1.000 times in 1.375305s 1.329016s
                                                     so_lists      2.920       3.007 i/s -       1.000 times in 0.342458s 0.332517s
                                                so_mandelbrot      0.533       0.533 i/s -       1.000 times in 1.874596s 1.875165s
                                                    so_matrix      2.293       2.623 i/s -       1.000 times in 0.436118s 0.381195s
                                            so_meteor_contest      0.469       0.480 i/s -       1.000 times in 2.131592s 2.084690s
                                                     so_nbody      0.909       0.939 i/s -       1.000 times in 1.100367s 1.064415s
                                               so_nested_loop      1.390       1.491 i/s -       1.000 times in 0.719196s 0.670909s
                                                    so_nsieve      0.891       0.931 i/s -       1.000 times in 1.121810s 1.074669s
                                               so_nsieve_bits      0.653       0.646 i/s -       1.000 times in 1.530978s 1.548282s
                                                    so_object      1.998       2.153 i/s -       1.000 times in 0.500585s 0.464415s
                                              so_partial_sums      0.684       0.669 i/s -       1.000 times in 1.461657s 1.495530s
                                                  so_pidigits      1.463       1.461 i/s -       1.000 times in 0.683522s 0.684659s
                                                    so_random      2.496       2.433 i/s -       1.000 times in 0.400716s 0.411032s
                                                     so_sieve      3.009       2.978 i/s -       1.000 times in 0.332282s 0.335750s
                                              so_spectralnorm      0.814       0.821 i/s -       1.000 times in 1.227841s 1.217327s
                                                 string_index      3.556       3.602 i/s -       1.000 times in 0.281188s 0.277659s
                                               string_scan_re      6.795       6.781 i/s -       1.000 times in 0.147165s 0.147471s
                                              string_scan_str      9.637      10.106 i/s -       1.000 times in 0.103771s 0.098950s
                                                  time_subsec      1.236       1.238 i/s -       1.000 times in 0.808866s 0.807594s
                                                vm3_backtrace     10.883      10.731 i/s -       1.000 times in 0.091885s 0.093188s
                                         vm3_clearmethodcache      5.779       5.889 i/s -       1.000 times in 0.173052s 0.169821s
                                                       vm3_gc      0.946       0.963 i/s -       1.000 times in 1.057203s 1.038752s
                                              vm3_gc_old_full      0.450       0.446 i/s -       1.000 times in 2.224673s 2.240602s
                                         vm3_gc_old_immediate      0.505       0.507 i/s -       1.000 times in 1.981630s 1.972916s
                                              vm3_gc_old_lazy      0.390       0.395 i/s -       1.000 times in 2.561190s 2.534075s
                                         vm_symbol_block_pass      1.525       1.543 i/s -       1.000 times in 0.655621s 0.647947s
                                       vm_thread_alive_check1      4.256       4.334 i/s -       1.000 times in 0.234975s 0.230747s
                                              vm_thread_close      1.562       1.560 i/s -       1.000 times in 0.640172s 0.641012s
                                           vm_thread_condvar1      0.164       0.164 i/s -       1.000 times in 6.079991s 6.093860s
                                           vm_thread_condvar2      0.169       0.171 i/s -       1.000 times in 5.910249s 5.842378s
                                        vm_thread_create_join      0.147       0.148 i/s -       1.000 times in 6.785046s 6.739855s
                                             vm_thread_mutex1      2.943       3.048 i/s -       1.000 times in 0.339827s 0.328114s
                                             vm_thread_mutex2      2.925       3.030 i/s -       1.000 times in 0.341903s 0.330072s
                                             vm_thread_mutex3      1.471 /tmp/benchmark_driver-20190318-15893-uk87t1.rb:26:in `initialize': can't create Thread: Resource temporarily unavailable (ThreadError)
        from /tmp/benchmark_driver-20190318-15893-uk87t1.rb:26:in `new'
        from /tmp/benchmark_driver-20190318-15893-uk87t1.rb:26:in `block in <main>'
        from /tmp/benchmark_driver-20190318-15893-uk87t1.rb:25:in `each'
        from /tmp/benchmark_driver-20190318-15893-uk87t1.rb:25:in `map'
        from /tmp/benchmark_driver-20190318-15893-uk87t1.rb:25:in `<main>'
/tmp/benchmark_driver-20190318-15893-qv885y.rb:26:in `initialize': can't create Thread: Resource temporarily unavailable (ThreadError)
        from /tmp/benchmark_driver-20190318-15893-qv885y.rb:26:in `new'
        from /tmp/benchmark_driver-20190318-15893-qv885y.rb:26:in `block in <main>'
        from /tmp/benchmark_driver-20190318-15893-qv885y.rb:25:in `each'
        from /tmp/benchmark_driver-20190318-15893-qv885y.rb:25:in `map'
        from /tmp/benchmark_driver-20190318-15893-qv885y.rb:25:in `<main>'
      1.508 i/s -       1.000 times in 0.679789s 0.662923s
                                               vm_thread_pass      1.658       1.682 i/s -       1.000 times in 0.603041s 0.594380s
                                         vm_thread_pass_flood     12.167      11.993 i/s -       1.000 times in 0.082190s 0.083385s
                                               vm_thread_pipe      4.814       5.770 i/s -       1.000 times in 0.207707s 0.173305s
                                              vm_thread_queue     13.202      14.092 i/s -       1.000 times in 0.075747s 0.070962s
                                        vm_thread_sized_queue      1.712       1.695 i/s -       1.000 times in 0.584052s 0.590099s
                                       vm_thread_sized_queue2      0.204       0.192 i/s -       1.000 times in 4.897092s 5.213936s
                                       vm_thread_sized_queue3      0.206       0.199 i/s -       1.000 times in 4.848554s 5.021454s
                                       vm_thread_sized_queue4      0.476       0.476 i/s -       1.000 times in 2.099935s 2.101791s
                                                      app_erb    18.571k     18.621k i/s -     15.000k times in 0.807703s 0.805526s
                                            complex_float_add    18.304M     19.850M i/s -      1.000M times in 0.054632s 0.050378s
                                            complex_float_div   804.935k    815.930k i/s -      1.000M times in 1.242337s 1.225596s
                                            complex_float_mul     7.585M      7.915M i/s -      1.000M times in 0.131848s 0.126346s
                                            complex_float_new     2.267M      2.437M i/s -      1.000M times in 0.441025s 0.410417s
                                          complex_float_power     2.702M      2.878M i/s -      1.000M times in 0.370076s 0.347506s
                                            complex_float_sub    12.500M     11.923M i/s -      1.000M times in 0.080002s 0.083873s
                                                   erb_render     1.687M      1.640M i/s -      1.500M times in 0.889161s 0.914826s
                                     (1..1_000_000).last(100)     1.039M      1.128M i/s -      3.079M times in 2.962974s 2.728951s
                                    (1..1_000_000).last(1000)   105.793k    114.397k i/s -    314.922k times in 2.976780s 2.752886s
                                   (1..1_000_000).last(10000)    10.614k     11.328k i/s -     32.066k times in 3.021134s 2.830698s
                                                      require      1.543       1.531 i/s -       1.000 times in 0.647928s 0.653296s
                                               require_thread      0.039       0.040 i/s -       1.000 times in 25.707235s 25.143825s
                                               so_count_words     10.267       8.572 i/s -       1.000 times in 0.097395s 0.116656s
                                              so_k_nucleotidepreparing /tmp/fasta.output.100000
      1.230       1.229 i/s -       1.000 times in 0.812981s 0.813680s
                                        so_reverse_complementpreparing /tmp/fasta.output.2500000
      0.994       0.968 i/s -       1.000 times in 1.006068s 1.033340s
Time.strptime("28/Aug/2005:06:54:20 +0000", "%d/%b/%Y:%T %z")   158.145k    153.978k i/s -    467.518k times in 2.956263s 3.036261s
                                     Time.strptime("1", "%s")     1.623M      1.580M i/s -      4.739M times in 2.919664s 3.000454s
                            Time.strptime("0 +0100", "%s %z")   229.866k    226.532k i/s -    671.087k times in 2.919471s 2.962433s
                              Time.strptime("0 UTC", "%s %z")   506.340k    499.099k i/s -      1.464M times in 2.891278s 2.933224s
                                Time.strptime("1.5", "%s.%N")     1.153M      1.148M i/s -      3.410M times in 2.956365s 2.971241s
                     Time.strptime("1.000000000001", "%s.%N")   653.277k    677.285k i/s -      1.951M times in 2.986892s 2.881013s
                 Time.strptime("20010203 -0200", "%Y%m%d %z")   152.038k    150.570k i/s -    452.603k times in 2.976903s 3.005921s
                   Time.strptime("20010203 UTC", "%Y%m%d %z")   223.095k    215.194k i/s -    657.740k times in 2.948245s 3.056495s
                           Time.strptime("2018-365", "%Y-%j")   138.117k    131.867k i/s -    391.366k times in 2.833576s 2.967889s
                           Time.strptime("2018-091", "%Y-%j")   132.252k    130.186k i/s -    406.644k times in 3.074767s 3.123567s
                                                vm1_attr_ivar    66.512M     66.547M i/s -     30.000M times in 0.451048s 0.450809s
                                            vm1_attr_ivar_set    49.351M     69.413M i/s -     30.000M times in 0.607891s 0.432194s
                                                    vm1_block    32.752M     31.093M i/s -     30.000M times in 0.915972s 0.964842s
                                               vm1_blockparam    36.197M     37.477M i/s -     30.000M times in 0.828792s 0.800489s
                                          vm1_blockparam_call    21.445M     20.791M i/s -     30.000M times in 1.398954s 1.442956s
                                          vm1_blockparam_pass    16.726M     16.002M i/s -     30.000M times in 1.793623s 1.874816s
                                         vm1_blockparam_yield    23.039M     21.739M i/s -     30.000M times in 1.302155s 1.380011s
                                                    vm1_const   206.190M    190.329M i/s -     30.000M times in 0.145497s 0.157622s
                                                   vm1_ensure      2.853       2.717 i/s -       1.000 times in 0.350459s 0.367995s
                                             vm1_float_simple    16.170M     13.889M i/s -     30.000M times in 1.855232s 2.159986s
                                           vm1_gc_short_lived     7.559M      7.552M i/s -     30.000M times in 3.968665s 3.972608s
                               vm1_gc_short_with_complex_long     9.750M      9.755M i/s -     30.000M times in 3.076927s 3.075316s
                                       vm1_gc_short_with_long     6.898M      7.653M i/s -     30.000M times in 4.349361s 3.920170s
                                     vm1_gc_short_with_symbol     8.470M      8.513M i/s -     30.000M times in 3.542060s 3.523985s
                                                vm1_gc_wb_ary    75.275M     87.272M i/s -     30.000M times in 0.398536s 0.343754s
                                       vm1_gc_wb_ary_promoted    74.603M     92.753M i/s -     30.000M times in 0.402127s 0.323441s
                                                vm1_gc_wb_obj    93.787M    124.324M i/s -     30.000M times in 0.319873s 0.241304s
                                       vm1_gc_wb_obj_promoted    94.269M    124.374M i/s -     30.000M times in 0.318237s 0.241208s
                                                     vm1_ivar   168.669M    196.132M i/s -     30.000M times in 0.177864s 0.152958s
                                                 vm1_ivar_set   113.435M    146.165M i/s -     30.000M times in 0.264468s 0.205247s
                                                   vm1_length   119.473M    141.665M i/s -     30.000M times in 0.251103s 0.211767s
                                                vm1_lvar_init      0.910       0.940 i/s -       1.000 times in 1.098793s 1.063384s
                                                 vm1_lvar_set    18.572M     18.433M i/s -     30.000M times in 1.615367s 1.627483s
                                                      vm1_neq    92.776M    101.414M i/s -     30.000M times in 0.323360s 0.295818s
                                                      vm1_not   253.468M    247.023M i/s -     30.000M times in 0.118358s 0.121446s
                                                   vm1_rescue   412.776M    419.803M i/s -     30.000M times in 0.072679s 0.071462s
                                             vm1_simplereturn    85.207M     93.571M i/s -     30.000M times in 0.352082s 0.320611s
                                                     vm1_swap   189.659M    173.338M i/s -     30.000M times in 0.158179s 0.173072s
                                                    vm1_yield      1.218       1.105 i/s -       1.000 times in 0.820754s 0.904794s
                                                    vm2_array    44.336M     45.667M i/s -      6.000M times in 0.135329s 0.131387s
                                                 vm2_bigarray    44.565M     45.682M i/s -      6.000M times in 0.134636s 0.131343s
                                                  vm2_bighash   615.331k    601.908k i/s -     60.000k times in 0.097509s 0.099683s
                                                     vm2_case    95.501M     98.707M i/s -      6.000M times in 0.062827s 0.060786s
                                                 vm2_case_lit      2.399       2.236 i/s -       1.000 times in 0.416921s 0.447224s
                                           vm2_defined_method     2.777M      3.000M i/s -      6.000M times in 2.160968s 1.999920s
                                                     vm2_dstr     7.101M      7.134M i/s -      6.000M times in 0.844973s 0.840990s
                                                     vm2_eval   414.478k    386.920k i/s -      6.000M times in 14.476036s 15.507099s
                                             vm2_fiber_switch    10.763M     11.145M i/s -      6.000M times in 0.557446s 0.538334s
                                             vm2_freezestring    11.091M     11.677M i/s -      6.000M times in 0.540962s 0.513827s
                                                   vm2_method    10.008M     15.199M i/s -      6.000M times in 0.599513s 0.394756s
                                           vm2_method_missing     3.563M      3.635M i/s -      6.000M times in 1.683852s 1.650804s
                                        vm2_method_with_block     7.760M     11.854M i/s -      6.000M times in 0.773177s 0.506156s
                                     vm2_module_ann_const_set     1.807M      1.811M i/s -      6.000M times in 3.320942s 3.313823s
                                         vm2_module_const_set     1.790M      1.801M i/s -      6.000M times in 3.351949s 3.332103s
                                                    vm2_mutex    15.938M     16.092M i/s -      6.000M times in 0.376468s 0.372858s
                                                vm2_newlambda    12.942M     13.685M i/s -      6.000M times in 0.463597s 0.438439s
                                              vm2_poly_method      0.597       0.511 i/s -       1.000 times in 1.676129s 1.957726s
                                           vm2_poly_method_ov      5.011       4.892 i/s -       1.000 times in 0.199565s 0.204428s
                                           vm2_poly_singleton      1.185       1.147 i/s -       1.000 times in 0.843718s 0.871925s
                                                     vm2_proc    43.869M     49.184M i/s -      6.000M times in 0.136771s 0.121990s
                                                   vm2_raise1     2.113M      2.131M i/s -      6.000M times in 2.838982s 2.815877s
                                                   vm2_raise2     1.257M      1.270M i/s -      6.000M times in 4.771427s 4.725603s
                                                   vm2_regexp     7.888M      7.793M i/s -      6.000M times in 0.760649s 0.769902s
                                                     vm2_send    27.575M     34.132M i/s -      6.000M times in 0.217584s 0.175789s
                                           vm2_string_literal    49.072M     50.477M i/s -      6.000M times in 0.122270s 0.118867s
                                       vm2_struct_big_aref_hi    54.829M     58.129M i/s -      6.000M times in 0.109432s 0.103218s
                                       vm2_struct_big_aref_lo    53.268M     57.067M i/s -      6.000M times in 0.112637s 0.105139s
                                          vm2_struct_big_aset      4.942       5.116 i/s -       1.000 times in 0.202360s 0.195475s
                                       vm2_struct_big_href_hi    30.351M     37.221M i/s -      6.000M times in 0.197684s 0.161201s
                                       vm2_struct_big_href_lo    32.352M     36.004M i/s -      6.000M times in 0.185460s 0.166647s
                                          vm2_struct_big_hset      3.532       3.921 i/s -       1.000 times in 0.283164s 0.255015s
                                        vm2_struct_small_aref    71.913M    100.203M i/s -      6.000M times in 0.083434s 0.059878s
                                        vm2_struct_small_aset      5.116       5.117 i/s -       1.000 times in 0.195449s 0.195431s
                                        vm2_struct_small_href    35.920M     41.460M i/s -      6.000M times in 0.167040s 0.144719s
                                        vm2_struct_small_hset    32.747M     41.035M i/s -      6.000M times in 0.183223s 0.146217s
                                                    vm2_super    23.172M     20.094M i/s -      6.000M times in 0.258937s 0.298600s
                                                    vm2_unif1    73.687M     77.696M i/s -      6.000M times in 0.081425s 0.077224s
                                                   vm2_zsuper    20.036M     20.982M i/s -      6.000M times in 0.299462s 0.285965s

Comparison:
                                                                app_answer
                                                        trunk:        53.7 i/s
                                                         ours:        51.2 i/s - 1.05x  slower

                                                               app_aobench
                                                         ours:         0.0 i/s
                                                        trunk:         0.0 i/s - 1.02x  slower

                                                             app_factorial
                                                         ours:         1.9 i/s
                                                        trunk:         1.6 i/s - 1.19x  slower

                                                                   app_fib
                                                        trunk:         3.3 i/s
                                                         ours:         3.3 i/s - 1.01x  slower

                                                           app_lc_fizzbuzz
                                                         ours:         0.0 i/s
                                                        trunk:         0.0 i/s - 1.02x  slower

                                                            app_mandelbrot
                                                        trunk:         1.9 i/s
                                                         ours:         1.9 i/s - 1.01x  slower

                                                             app_pentomino
                                                         ours:         0.1 i/s
                                                        trunk:         0.1 i/s - 1.01x  slower

                                                                 app_raise
                                                         ours:         7.9 i/s
                                                        trunk:         7.9 i/s - 1.00x  slower

                                                             app_strconcat
                                                        trunk:         2.7 i/s
                                                         ours:         2.6 i/s - 1.03x  slower

                                                                   app_tak
                                                        trunk:         2.4 i/s
                                                         ours:         2.2 i/s - 1.10x  slower

                                                                 app_tarai
                                                        trunk:         3.1 i/s
                                                         ours:         2.7 i/s - 1.14x  slower

                                                                   app_uri
                                                         ours:         2.7 i/s
                                                        trunk:         2.6 i/s - 1.02x  slower

                                                      array_sample_100k_10
                                                         ours:       157.7 i/s
                                                        trunk:       154.1 i/s - 1.02x  slower

                                                      array_sample_100k_11
                                                        trunk:       105.8 i/s
                                                         ours:        96.3 i/s - 1.10x  slower

                                                     array_sample_100k__1k
                                                        trunk:         2.3 i/s
                                                         ours:         2.3 i/s - 1.01x  slower

                                                     array_sample_100k__6k
                                                        trunk:         0.6 i/s
                                                         ours:         0.6 i/s - 1.06x  slower

                                                    array_sample_100k__100
                                                        trunk:        20.4 i/s
                                                         ours:        19.7 i/s - 1.04x  slower

                                                   array_sample_100k___10k
                                                        trunk:         0.5 i/s
                                                         ours:         0.4 i/s - 1.05x  slower

                                                   array_sample_100k___50k
                                                        trunk:         0.1 i/s
                                                         ours:         0.1 i/s - 1.05x  slower

                                                               array_shift
                                                         ours:         0.5 i/s
                                                        trunk:         0.4 i/s - 1.18x  slower

                                                           array_small_and
                                                        trunk:        90.4 i/s
                                                         ours:        82.4 i/s - 1.10x  slower

                                                          array_small_diff
                                                        trunk:        85.5 i/s
                                                         ours:        82.9 i/s - 1.03x  slower

                                                            array_small_or
                                                        trunk:        58.7 i/s
                                                         ours:        55.9 i/s - 1.05x  slower

                                                          array_sort_block
                                                         ours:         0.2 i/s
                                                        trunk:         0.2 i/s - 1.05x  slower

                                                          array_sort_float
                                                         ours:         0.8 i/s
                                                        trunk:         0.7 i/s - 1.11x  slower

                                                       array_values_at_int
                                                         ours:       140.0 i/s
                                                        trunk:       132.5 i/s - 1.06x  slower

                                                     array_values_at_range
                                                         ours:         4.9 i/s
                                                        trunk:         4.7 i/s - 1.05x  slower

                                                                   bighash
                                                         ours:         0.7 i/s
                                                        trunk:         0.7 i/s - 1.01x  slower

                                                               dir_empty_p
                                                         ours:         3.3 i/s
                                                        trunk:         3.2 i/s - 1.00x  slower

                                                       enum_lazy_grep_v_20
                                                         ours:         7.1 i/s
                                                        trunk:         7.0 i/s - 1.00x  slower

                                                       enum_lazy_grep_v_50
                                                         ours:         5.3 i/s
                                                        trunk:         5.2 i/s - 1.01x  slower

                                                      enum_lazy_grep_v_100
                                                         ours:         3.9 i/s
                                                        trunk:         3.8 i/s - 1.03x  slower

                                                         enum_lazy_uniq_20
                                                         ours:         6.2 i/s
                                                        trunk:         6.2 i/s - 1.01x  slower

                                                         enum_lazy_uniq_50
                                                         ours:         4.5 i/s
                                                        trunk:         4.5 i/s - 1.00x  slower

                                                        enum_lazy_uniq_100
                                                         ours:         3.1 i/s
                                                        trunk:         3.0 i/s - 1.01x  slower

                                                               fiber_chain
                                                         ours:         0.9 i/s
                                                        trunk:         0.9 i/s - 1.04x  slower

                                                                file_chmod
                                                         ours:         3.3 i/s
                                                        trunk:         3.3 i/s - 1.02x  slower

                                                               file_rename
                                                         ours:         0.4 i/s
                                                        trunk:         0.4 i/s - 1.08x  slower

                                                            hash_aref_dsym
                                                         ours:         4.2 i/s
                                                        trunk:         4.0 i/s - 1.04x  slower

                                                       hash_aref_dsym_long
                                                        trunk:         0.3 i/s
                                                         ours:         0.3 i/s - 1.00x  slower

                                                             hash_aref_fix
                                                         ours:         4.4 i/s
                                                        trunk:         4.2 i/s - 1.04x  slower

                                                             hash_aref_flo
                                                        trunk:        38.1 i/s
                                                         ours:        37.9 i/s - 1.01x  slower

                                                            hash_aref_miss
                                                         ours:         3.2 i/s
                                                        trunk:         3.1 i/s - 1.03x  slower

                                                             hash_aref_str
                                                         ours:         3.5 i/s
                                                        trunk:         3.5 i/s - 1.01x  slower

                                                             hash_aref_sym
                                                         ours:         4.1 i/s
                                                        trunk:         3.9 i/s - 1.05x  slower

                                                        hash_aref_sym_long
                                                         ours:         2.9 i/s
                                                        trunk:         2.8 i/s - 1.05x  slower

                                                              hash_flatten
                                                         ours:         7.0 i/s
                                                        trunk:         7.0 i/s - 1.01x  slower

                                                            hash_ident_flo
                                                        trunk:        40.6 i/s
                                                         ours:        37.8 i/s - 1.07x  slower

                                                            hash_ident_num
                                                         ours:         4.5 i/s
                                                        trunk:         4.3 i/s - 1.04x  slower

                                                            hash_ident_obj
                                                         ours:         4.4 i/s
                                                        trunk:         4.3 i/s - 1.03x  slower

                                                            hash_ident_str
                                                         ours:         4.4 i/s
                                                        trunk:         4.2 i/s - 1.04x  slower

                                                            hash_ident_sym
                                                         ours:         4.4 i/s
                                                        trunk:         4.2 i/s - 1.04x  slower

                                                                 hash_keys
                                                        trunk:        12.0 i/s
                                                         ours:        12.0 i/s - 1.01x  slower

                                                       hash_literal_small2
                                                         ours:         2.1 i/s
                                                        trunk:         2.0 i/s - 1.02x  slower

                                                       hash_literal_small4
                                                        trunk:         1.7 i/s
                                                         ours:         1.7 i/s - 1.01x  slower

                                                       hash_literal_small8
                                                        trunk:         1.2 i/s
                                                         ours:         1.2 i/s - 1.01x  slower

                                                                 hash_long
                                                        trunk:         2.0 i/s
                                                         ours:         2.0 i/s - 1.01x  slower

                                                                hash_shift
                                                        trunk:       127.3 i/s
                                                         ours:       126.2 i/s - 1.01x  slower

                                                            hash_shift_u16
                                                         ours:        21.0 i/s
                                                        trunk:        20.6 i/s - 1.02x  slower

                                                            hash_shift_u24
                                                         ours:        21.8 i/s
                                                        trunk:        20.4 i/s - 1.07x  slower

                                                            hash_shift_u32
                                                         ours:        20.7 i/s
                                                        trunk:        19.6 i/s - 1.05x  slower

                                                               hash_small2
                                                         ours:         2.0 i/s
                                                        trunk:         1.9 i/s - 1.01x  slower

                                                               hash_small4
                                                         ours:         1.6 i/s
                                                        trunk:         1.5 i/s - 1.02x  slower

                                                               hash_small8
                                                         ours:         1.1 i/s
                                                        trunk:         1.1 i/s - 1.04x  slower

                                                              hash_to_proc
                                                         ours:       406.3 i/s
                                                        trunk:       399.7 i/s - 1.02x  slower

                                                               hash_values
                                                        trunk:        11.9 i/s
                                                         ours:        11.8 i/s - 1.01x  slower

                                                                   int_quo
                                                         ours:         1.3 i/s
                                                        trunk:         1.3 i/s - 1.03x  slower

                                                      io_copy_stream_write
                                                         ours:         6.2 i/s
                                                        trunk:         6.0 i/s - 1.03x  slower

                                               io_copy_stream_write_socket
                                                         ours:         3.0 i/s
                                                        trunk:         2.8 i/s - 1.05x  slower

                                                            io_file_create
                                                         ours:         1.0 i/s
                                                        trunk:         0.9 i/s - 1.04x  slower

                                                              io_file_read
                                                        trunk:         1.1 i/s
                                                         ours:         1.1 i/s - 1.00x  slower

                                                             io_file_write
                                                         ours:         1.6 i/s
                                                        trunk:         1.6 i/s - 1.01x  slower

                                                          io_nonblock_noex
                                                         ours:         0.6 i/s
                                                        trunk:         0.6 i/s - 1.01x  slower

                                                         io_nonblock_noex2
                                                        trunk:         0.7 i/s
                                                         ours:         0.7 i/s - 1.01x  slower

                                                                io_pipe_rw
                                                         ours:         1.0 i/s
                                                        trunk:         1.0 i/s - 1.01x  slower

                                                                 io_select
                                                         ours:         0.7 i/s
                                                        trunk:         0.7 i/s - 1.01x  slower

                                                                io_select2
                                                        trunk:         0.6 i/s
                                                         ours:         0.6 i/s - 1.01x  slower

                                                                io_select3
                                                        trunk:        94.5 i/s
                                                         ours:        86.2 i/s - 1.10x  slower

                                                                  loop_for
                                                         ours:         1.2 i/s
                                                        trunk:         1.2 i/s - 1.02x  slower

                                                            loop_generator
                                                         ours:         9.2 i/s
                                                        trunk:         8.7 i/s - 1.06x  slower

                                                                loop_times
                                                         ours:         1.3 i/s
                                                        trunk:         1.3 i/s - 1.04x  slower

                                                            loop_whileloop
                                                        trunk:         2.9 i/s
                                                         ours:         2.7 i/s - 1.05x  slower

                                                           loop_whileloop2
                                                        trunk:        14.2 i/s
                                                         ours:        13.6 i/s - 1.04x  slower

                                                          marshal_dump_flo
                                                         ours:         5.7 i/s
                                                        trunk:         5.5 i/s - 1.04x  slower

                                                   marshal_dump_load_geniv
                                                         ours:         3.5 i/s
                                                        trunk:         3.4 i/s - 1.01x  slower

                                                    marshal_dump_load_time
                                                         ours:         1.6 i/s
                                                        trunk:         1.5 i/s - 1.03x  slower

                                                              securerandom
                                                         ours:         6.4 i/s
                                                        trunk:         6.3 i/s - 1.02x  slower

                                                              so_ackermann
                                                        trunk:         3.4 i/s
                                                         ours:         3.2 i/s - 1.06x  slower

                                                                  so_array
                                                         ours:         1.8 i/s
                                                        trunk:         1.7 i/s - 1.05x  slower

                                                           so_binary_trees
                                                         ours:         0.3 i/s
                                                        trunk:         0.2 i/s - 1.01x  slower

                                                            so_concatenate
                                                        trunk:         0.4 i/s
                                                         ours:         0.4 i/s - 1.01x  slower

                                                              so_exception
                                                         ours:         6.0 i/s
                                                        trunk:         5.7 i/s - 1.04x  slower

                                                               so_fannkuch
                                                        trunk:         1.9 i/s
                                                         ours:         1.9 i/s - 1.01x  slower

                                                                  so_fasta
                                                         ours:         0.8 i/s
                                                        trunk:         0.7 i/s - 1.03x  slower

                                                                  so_lists
                                                         ours:         3.0 i/s
                                                        trunk:         2.9 i/s - 1.03x  slower

                                                             so_mandelbrot
                                                        trunk:         0.5 i/s
                                                         ours:         0.5 i/s - 1.00x  slower

                                                                 so_matrix
                                                         ours:         2.6 i/s
                                                        trunk:         2.3 i/s - 1.14x  slower

                                                         so_meteor_contest
                                                         ours:         0.5 i/s
                                                        trunk:         0.5 i/s - 1.02x  slower

                                                                  so_nbody
                                                         ours:         0.9 i/s
                                                        trunk:         0.9 i/s - 1.03x  slower

                                                            so_nested_loop
                                                         ours:         1.5 i/s
                                                        trunk:         1.4 i/s - 1.07x  slower

                                                                 so_nsieve
                                                         ours:         0.9 i/s
                                                        trunk:         0.9 i/s - 1.04x  slower

                                                            so_nsieve_bits
                                                        trunk:         0.7 i/s
                                                         ours:         0.6 i/s - 1.01x  slower

                                                                 so_object
                                                         ours:         2.2 i/s
                                                        trunk:         2.0 i/s - 1.08x  slower

                                                           so_partial_sums
                                                        trunk:         0.7 i/s
                                                         ours:         0.7 i/s - 1.02x  slower

                                                               so_pidigits
                                                        trunk:         1.5 i/s
                                                         ours:         1.5 i/s - 1.00x  slower

                                                                 so_random
                                                        trunk:         2.5 i/s
                                                         ours:         2.4 i/s - 1.03x  slower

                                                                  so_sieve
                                                        trunk:         3.0 i/s
                                                         ours:         3.0 i/s - 1.01x  slower

                                                           so_spectralnorm
                                                         ours:         0.8 i/s
                                                        trunk:         0.8 i/s - 1.01x  slower

                                                              string_index
                                                         ours:         3.6 i/s
                                                        trunk:         3.6 i/s - 1.01x  slower

                                                            string_scan_re
                                                        trunk:         6.8 i/s
                                                         ours:         6.8 i/s - 1.00x  slower

                                                           string_scan_str
                                                         ours:        10.1 i/s
                                                        trunk:         9.6 i/s - 1.05x  slower

                                                               time_subsec
                                                         ours:         1.2 i/s
                                                        trunk:         1.2 i/s - 1.00x  slower

                                                             vm3_backtrace
                                                        trunk:        10.9 i/s
                                                         ours:        10.7 i/s - 1.01x  slower

                                                      vm3_clearmethodcache
                                                         ours:         5.9 i/s
                                                        trunk:         5.8 i/s - 1.02x  slower

                                                                    vm3_gc
                                                         ours:         1.0 i/s
                                                        trunk:         0.9 i/s - 1.02x  slower

                                                           vm3_gc_old_full
                                                        trunk:         0.4 i/s
                                                         ours:         0.4 i/s - 1.01x  slower

                                                      vm3_gc_old_immediate
                                                         ours:         0.5 i/s
                                                        trunk:         0.5 i/s - 1.00x  slower

                                                           vm3_gc_old_lazy
                                                         ours:         0.4 i/s
                                                        trunk:         0.4 i/s - 1.01x  slower

                                                      vm_symbol_block_pass
                                                         ours:         1.5 i/s
                                                        trunk:         1.5 i/s - 1.01x  slower

                                                    vm_thread_alive_check1
                                                         ours:         4.3 i/s
                                                        trunk:         4.3 i/s - 1.02x  slower

                                                           vm_thread_close
                                                        trunk:         1.6 i/s
                                                         ours:         1.6 i/s - 1.00x  slower

                                                        vm_thread_condvar1
                                                        trunk:         0.2 i/s
                                                         ours:         0.2 i/s - 1.00x  slower

                                                        vm_thread_condvar2
                                                         ours:         0.2 i/s
                                                        trunk:         0.2 i/s - 1.01x  slower

                                                     vm_thread_create_join
                                                         ours:         0.1 i/s
                                                        trunk:         0.1 i/s - 1.01x  slower

                                                          vm_thread_mutex1
                                                         ours:         3.0 i/s
                                                        trunk:         2.9 i/s - 1.04x  slower

                                                          vm_thread_mutex2
                                                         ours:         3.0 i/s
                                                        trunk:         2.9 i/s - 1.04x  slower

                                                          vm_thread_mutex3
                                                         ours:         1.5 i/s
                                                        trunk:         1.5 i/s - 1.03x  slower

                                                            vm_thread_pass
                                                         ours:         1.7 i/s
                                                        trunk:         1.7 i/s - 1.01x  slower

                                                      vm_thread_pass_flood
                                                        trunk:        12.2 i/s
                                                         ours:        12.0 i/s - 1.01x  slower

                                                            vm_thread_pipe
                                                         ours:         5.8 i/s
                                                        trunk:         4.8 i/s - 1.20x  slower

                                                           vm_thread_queue
                                                         ours:        14.1 i/s
                                                        trunk:        13.2 i/s - 1.07x  slower

                                                     vm_thread_sized_queue
                                                        trunk:         1.7 i/s
                                                         ours:         1.7 i/s - 1.01x  slower

                                                    vm_thread_sized_queue2
                                                        trunk:         0.2 i/s
                                                         ours:         0.2 i/s - 1.06x  slower

                                                    vm_thread_sized_queue3
                                                        trunk:         0.2 i/s
                                                         ours:         0.2 i/s - 1.04x  slower

                                                    vm_thread_sized_queue4
                                                        trunk:         0.5 i/s
                                                         ours:         0.5 i/s - 1.00x  slower

                                                                   app_erb
                                                         ours:     18621.4 i/s
                                                        trunk:     18571.2 i/s - 1.00x  slower

                                                         complex_float_add
                                                         ours:  19850100.0 i/s
                                                        trunk:  18304356.2 i/s - 1.08x  slower

                                                         complex_float_div
                                                         ours:    815929.6 i/s
                                                        trunk:    804934.9 i/s - 1.01x  slower

                                                         complex_float_mul
                                                         ours:   7914792.1 i/s
                                                        trunk:   7584506.3 i/s - 1.04x  slower

                                                         complex_float_new
                                                         ours:   2436543.3 i/s
                                                        trunk:   2267444.3 i/s - 1.07x  slower

                                                       complex_float_power
                                                         ours:   2877650.3 i/s
                                                        trunk:   2702150.0 i/s - 1.06x  slower

                                                         complex_float_sub
                                                        trunk:  12499726.7 i/s
                                                         ours:  11922735.3 i/s - 1.05x  slower

                                                                erb_render
                                                        trunk:   1686982.8 i/s
                                                         ours:   1639655.6 i/s - 1.03x  slower

                                                  (1..1_000_000).last(100)
                                                         ours:   1128359.0 i/s
                                                        trunk:   1039238.1 i/s - 1.09x  slower

                                                 (1..1_000_000).last(1000)
                                                         ours:    114397.0 i/s
                                                        trunk:    105792.8 i/s - 1.08x  slower

                                                (1..1_000_000).last(10000)
                                                         ours:     11327.9 i/s
                                                        trunk:     10613.9 i/s - 1.07x  slower

                                                                   require
                                                        trunk:         1.5 i/s
                                                         ours:         1.5 i/s - 1.01x  slower

                                                            require_thread
                                                         ours:         0.0 i/s
                                                        trunk:         0.0 i/s - 1.02x  slower

                                                            so_count_words
                                                        trunk:        10.3 i/s
                                                         ours:         8.6 i/s - 1.20x  slower

                                                           so_k_nucleotide
                                                        trunk:         1.2 i/s
                                                         ours:         1.2 i/s - 1.00x  slower

                                                     so_reverse_complement
                                                        trunk:         1.0 i/s
                                                         ours:         1.0 i/s - 1.03x  slower

             Time.strptime("28/Aug/2005:06:54:20 +0000", "%d/%b/%Y:%T %z")
                                                        trunk:    158144.9 i/s
                                                         ours:    153978.2 i/s - 1.03x  slower

                                                  Time.strptime("1", "%s")
                                                        trunk:   1623217.0 i/s
                                                         ours:   1579510.9 i/s - 1.03x  slower

                                         Time.strptime("0 +0100", "%s %z")
                                                        trunk:    229866.0 i/s
                                                         ours:    226532.4 i/s - 1.01x  slower

                                           Time.strptime("0 UTC", "%s %z")
                                                        trunk:    506339.7 i/s
                                                         ours:    499098.9 i/s - 1.01x  slower

                                             Time.strptime("1.5", "%s.%N")
                                                        trunk:   1153359.5 i/s
                                                         ours:   1147585.1 i/s - 1.01x  slower

                                  Time.strptime("1.000000000001", "%s.%N")
                                                         ours:    677285.4 i/s
                                                        trunk:    653277.1 i/s - 1.04x  slower

                              Time.strptime("20010203 -0200", "%Y%m%d %z")
                                                        trunk:    152038.2 i/s
                                                         ours:    150570.5 i/s - 1.01x  slower

                                Time.strptime("20010203 UTC", "%Y%m%d %z")
                                                        trunk:    223095.4 i/s
                                                         ours:    215194.2 i/s - 1.04x  slower

                                        Time.strptime("2018-365", "%Y-%j")
                                                        trunk:    138117.4 i/s
                                                         ours:    131866.8 i/s - 1.05x  slower

                                        Time.strptime("2018-091", "%Y-%j")
                                                        trunk:    132252.0 i/s
                                                         ours:    130185.8 i/s - 1.02x  slower

                                                             vm1_attr_ivar
                                                         ours:  66547093.5 i/s
                                                        trunk:  66511758.3 i/s - 1.00x  slower

                                                         vm1_attr_ivar_set
                                                         ours:  69413312.4 i/s
                                                        trunk:  49350947.3 i/s - 1.41x  slower

                                                                 vm1_block
                                                        trunk:  32752098.6 i/s
                                                         ours:  31093175.3 i/s - 1.05x  slower

                                                            vm1_blockparam
                                                         ours:  37477110.3 i/s
                                                        trunk:  36197265.4 i/s - 1.04x  slower

                                                       vm1_blockparam_call
                                                        trunk:  21444591.4 i/s
                                                         ours:  20790648.5 i/s - 1.03x  slower

                                                       vm1_blockparam_pass
                                                        trunk:  16725920.2 i/s
                                                         ours:  16001570.9 i/s - 1.05x  slower

                                                      vm1_blockparam_yield
                                                        trunk:  23038732.7 i/s
                                                         ours:  21738951.3 i/s - 1.06x  slower

                                                                 vm1_const
                                                        trunk: 206190049.3 i/s
                                                         ours: 190329349.3 i/s - 1.08x  slower

                                                                vm1_ensure
                                                        trunk:         2.9 i/s
                                                         ours:         2.7 i/s - 1.05x  slower

                                                          vm1_float_simple
                                                        trunk:  16170483.1 i/s
                                                         ours:  13888976.9 i/s - 1.16x  slower

                                                        vm1_gc_short_lived
                                                        trunk:   7559217.7 i/s
                                                         ours:   7551714.4 i/s - 1.00x  slower

                                            vm1_gc_short_with_complex_long
                                                         ours:   9755094.2 i/s
                                                        trunk:   9749988.0 i/s - 1.00x  slower

                                                    vm1_gc_short_with_long
                                                         ours:   7652729.7 i/s
                                                        trunk:   6897564.9 i/s - 1.11x  slower

                                                  vm1_gc_short_with_symbol
                                                         ours:   8513088.5 i/s
                                                        trunk:   8469648.7 i/s - 1.01x  slower

                                                             vm1_gc_wb_ary
                                                         ours:  87271750.1 i/s
                                                        trunk:  75275450.4 i/s - 1.16x  slower

                                                    vm1_gc_wb_ary_promoted
                                                         ours:  92752671.4 i/s
                                                        trunk:  74603365.1 i/s - 1.24x  slower

                                                             vm1_gc_wb_obj
                                                         ours: 124324358.2 i/s
                                                        trunk:  93787340.6 i/s - 1.33x  slower

                                                    vm1_gc_wb_obj_promoted
                                                         ours: 124374045.6 i/s
                                                        trunk:  94269369.4 i/s - 1.32x  slower

                                                                  vm1_ivar
                                                         ours: 196132010.0 i/s
                                                        trunk: 168668511.7 i/s - 1.16x  slower

                                                              vm1_ivar_set
                                                         ours: 146165033.0 i/s
                                                        trunk: 113435129.7 i/s - 1.29x  slower

                                                                vm1_length
                                                         ours: 141664937.3 i/s
                                                        trunk: 119473053.1 i/s - 1.19x  slower

                                                             vm1_lvar_init
                                                         ours:         0.9 i/s
                                                        trunk:         0.9 i/s - 1.03x  slower

                                                              vm1_lvar_set
                                                        trunk:  18571632.0 i/s
                                                         ours:  18433375.6 i/s - 1.01x  slower

                                                                   vm1_neq
                                                         ours: 101413858.9 i/s
                                                        trunk:  92775759.4 i/s - 1.09x  slower

                                                                   vm1_not
                                                        trunk: 253467676.5 i/s
                                                         ours: 247023415.2 i/s - 1.03x  slower

                                                                vm1_rescue
                                                         ours: 419802709.5 i/s
                                                        trunk: 412776080.1 i/s - 1.02x  slower

                                                          vm1_simplereturn
                                                         ours:  93571297.8 i/s
                                                        trunk:  85207339.5 i/s - 1.10x  slower

                                                                  vm1_swap
                                                        trunk: 189658514.2 i/s
                                                         ours: 173337887.3 i/s - 1.09x  slower

                                                                 vm1_yield
                                                        trunk:         1.2 i/s
                                                         ours:         1.1 i/s - 1.10x  slower

                                                                 vm2_array
                                                         ours:  45666761.0 i/s
                                                        trunk:  44336306.2 i/s - 1.03x  slower

                                                              vm2_bigarray
                                                         ours:  45681847.3 i/s
                                                        trunk:  44564545.6 i/s - 1.03x  slower

                                                               vm2_bighash
                                                        trunk:    615330.6 i/s
                                                         ours:    601907.8 i/s - 1.02x  slower

                                                                  vm2_case
                                                         ours:  98706578.6 i/s
                                                        trunk:  95500778.5 i/s - 1.03x  slower

                                                              vm2_case_lit
                                                        trunk:         2.4 i/s
                                                         ours:         2.2 i/s - 1.07x  slower

                                                        vm2_defined_method
                                                         ours:   3000120.5 i/s
                                                        trunk:   2776533.8 i/s - 1.08x  slower

                                                                  vm2_dstr
                                                         ours:   7134444.5 i/s
                                                        trunk:   7100820.0 i/s - 1.00x  slower

                                                                  vm2_eval
                                                        trunk:    414478.1 i/s
                                                         ours:    386919.6 i/s - 1.07x  slower

                                                          vm2_fiber_switch
                                                         ours:  11145487.4 i/s
                                                        trunk:  10763368.5 i/s - 1.04x  slower

                                                          vm2_freezestring
                                                         ours:  11677078.2 i/s
                                                        trunk:  11091342.2 i/s - 1.05x  slower

                                                                vm2_method
                                                         ours:  15199255.8 i/s
                                                        trunk:  10008117.0 i/s - 1.52x  slower

                                                        vm2_method_missing
                                                         ours:   3634592.6 i/s
                                                        trunk:   3563257.7 i/s - 1.02x  slower

                                                     vm2_method_with_block
                                                         ours:  11854056.9 i/s
                                                        trunk:   7760194.2 i/s - 1.53x  slower

                                                  vm2_module_ann_const_set
                                                         ours:   1810597.8 i/s
                                                        trunk:   1806716.1 i/s - 1.00x  slower

                                                      vm2_module_const_set
                                                         ours:   1800664.6 i/s
                                                        trunk:   1790003.5 i/s - 1.01x  slower

                                                                 vm2_mutex
                                                         ours:  16091923.2 i/s
                                                        trunk:  15937595.1 i/s - 1.01x  slower

                                                             vm2_newlambda
                                                         ours:  13684899.0 i/s
                                                        trunk:  12942282.6 i/s - 1.06x  slower

                                                           vm2_poly_method
                                                        trunk:         0.6 i/s
                                                         ours:         0.5 i/s - 1.17x  slower

                                                        vm2_poly_method_ov
                                                        trunk:         5.0 i/s
                                                         ours:         4.9 i/s - 1.02x  slower

                                                        vm2_poly_singleton
                                                        trunk:         1.2 i/s
                                                         ours:         1.1 i/s - 1.03x  slower

                                                                  vm2_proc
                                                         ours:  49184254.5 i/s
                                                        trunk:  43868935.7 i/s - 1.12x  slower

                                                                vm2_raise1
                                                         ours:   2130775.0 i/s
                                                        trunk:   2113433.9 i/s - 1.01x  slower

                                                                vm2_raise2
                                                         ours:   1269679.1 i/s
                                                        trunk:   1257485.6 i/s - 1.01x  slower

                                                                vm2_regexp
                                                        trunk:   7888004.0 i/s
                                                         ours:   7793196.5 i/s - 1.01x  slower

                                                                  vm2_send
                                                         ours:  34131883.3 i/s
                                                        trunk:  27575494.3 i/s - 1.24x  slower

                                                        vm2_string_literal
                                                         ours:  50476775.4 i/s
                                                        trunk:  49071788.7 i/s - 1.03x  slower

                                                    vm2_struct_big_aref_hi
                                                         ours:  58129477.1 i/s
                                                        trunk:  54828543.3 i/s - 1.06x  slower

                                                    vm2_struct_big_aref_lo
                                                         ours:  57067278.9 i/s
                                                        trunk:  53268249.9 i/s - 1.07x  slower

                                                       vm2_struct_big_aset
                                                         ours:         5.1 i/s
                                                        trunk:         4.9 i/s - 1.04x  slower

                                                    vm2_struct_big_href_hi
                                                         ours:  37220649.5 i/s
                                                        trunk:  30351451.0 i/s - 1.23x  slower

                                                    vm2_struct_big_href_lo
                                                         ours:  36004174.2 i/s
                                                        trunk:  32351983.7 i/s - 1.11x  slower

                                                       vm2_struct_big_hset
                                                         ours:         3.9 i/s
                                                        trunk:         3.5 i/s - 1.11x  slower

                                                     vm2_struct_small_aref
                                                         ours: 100203443.0 i/s
                                                        trunk:  71913237.5 i/s - 1.39x  slower

                                                     vm2_struct_small_aset
                                                         ours:         5.1 i/s
                                                        trunk:         5.1 i/s - 1.00x  slower

                                                     vm2_struct_small_href
                                                         ours:  41459782.4 i/s
                                                        trunk:  35919614.6 i/s - 1.15x  slower

                                                     vm2_struct_small_hset
                                                         ours:  41034910.6 i/s
                                                        trunk:  32746917.4 i/s - 1.25x  slower

                                                                 vm2_super
                                                        trunk:  23171700.7 i/s
                                                         ours:  20093768.5 i/s - 1.15x  slower

                                                                 vm2_unif1
                                                         ours:  77696470.6 i/s
                                                        trunk:  73687095.8 i/s - 1.05x  slower

                                                                vm2_zsuper
                                                         ours:  20981592.4 i/s
                                                        trunk:  20035914.8 i/s - 1.05x  slower

Results of time make rdoc

RDoc generation has historically acted as a real-world practical use-case of a mid-size Ruby project. With the proposed changeset, its execution time slowed down, however only some 0.3 seconds or so.
Calculating -------------------------------------
               trunk        ours
make rdoc      0.043       0.042 i/s -       1.000 times in 23.13s 23.58s

Comparison:
             make rdoc
                   trunk:         0.0 i/s
                    ours:         0.0 i/s - 1.02x  slower

Results of mame/optcarrot

Optcarrot comes with a benchmark script to measure interpreter performance. Its result shows that our proposal outperforms the trunk, though, the margin is very little.
Calculating -------------------------------------
                              trunk        ours
Optcarrot Lan_Master.nes     42.554      43.276 fps

Comparison:
             Optcarrot Lan_Master.nes
                    ours:        43.3 fps
                   trunk:        42.6 fps - 1.02x  slower

Results of discourse/discourse

Discourse is an open-sourced Rails application. It comes with a benchmark script. Because this is a real application used in productions, it has the largest LOC among others. The result shows that our prosal yields better response time, while increasing load time is observed.

Results for trunk

---
categories:
  50: 48
  75: 59
  90: 65
  99: 112
home:
  50: 51
  75: 62
  90: 69
  99: 119
topic:
  50: 12
  75: 13
  90: 15
  99: 44
categories_admin:
  50: 83
  75: 96
  90: 102
  99: 169
home_admin:
  50: 85
  75: 98
  90: 109
  99: 174
topic_admin:
  50: 42
  75: 47
  90: 55
  99: 115
timings:
  load_rails: 3707
ruby-version: 2.7.0-p-1
rss_kb: 245800
pss_kb: 237995
architecture: amd64
operatingsystem: Ubuntu
processor0: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
virtual: physical
kernelversion: 4.15.0
physicalprocessorcount: 1
memorysize: 9.86 GB

Results for ours

---
categories:
  50: 47
  75: 50
  90: 64
  99: 98
home:
  50: 50
  75: 56
  90: 69
  99: 115
topic:
  50: 12
  75: 12
  90: 13
  99: 42
categories_admin:
  50: 80
  75: 87
  90: 100
  99: 157
home_admin:
  50: 82
  75: 93
  90: 102
  99: 171
topic_admin:
  50: 42
  75: 45
  90: 56
  99: 101
timings:
  load_rails: 3963
ruby-version: 2.7.0-p-1
rss_kb: 268452
pss_kb: 260307
architecture: amd64
operatingsystem: Ubuntu
processor0: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
virtual: physical
kernelversion: 4.15.0
physicalprocessorcount: 1
memorysize: 9.86 GB

Conclusions

Two techniques are implemented to optimise the send-pop sequence: additional ABI flags to inform the callee about the optimisation, and actual deletion of the caller sequence. These techniques sacrifice the process bootup time to yield better performance, both on microbenchmarks and on a Rails application.

Future work

A well-known waste of memory is when a block ends with an assignment. "Just in case" the value of that block is used, an array is created to store the assigned values, like this:
% ruby --dump=i -ve '1.times {|i| x, y = self, i }'
ruby 2.7.0dev (2019-03-05 trunk 67168) [x86_64-linux]
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,29)> (catch: FALSE)
== catch table
| catch type: break  st: 0000 ed: 0005 sp: 0000 cont: 0005
| == disasm: #<ISeq:block in <main>@-e:1 (1,8)-(1,29)> (catch: FALSE)
| == catch table
| | catch type: redo   st: 0001 ed: 0014 sp: 0000 cont: 0001
| | catch type: next   st: 0001 ed: 0014 sp: 0000 cont: 0014
| |------------------------------------------------------------------------
| local table (size: 3, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
| [ 3] i@0<Arg>   | [ 2] x@1        | [ 1] y@2
| 0000 nop                                                              (   1)[Bc]
| 0001 putself                      [Li]
| 0002 getlocal_WC_0                i@0
| 0004 newarray                     2
| 0006 dup
| 0007 expandarray                  2, 0
| 0010 setlocal_WC_0                x@1
| 0012 setlocal_WC_0                y@2
| 0014 nop
| 0015 leave                                                            (   1)[Br]
|------------------------------------------------------------------------
0000 putobject_INT2FIX_1_                                             (   1)[Li]
0001 send                         <callinfo!mid:times, argc:0>, <callcache>, block in <main>
0005 nop
0006 leave                                                            (   1)

Note the unnecessarily complex output of block in <main>'s disasm. The proposed changeset is not capable of eliminating this newarray. That instruction is in the middle of the instruction sequence. Because the proposed optimisation can only eliminate the rearmost pure instructions of the sequence, the optimisation cannot reach there.

In order to make it optimisable, we can think of a new branch instruction that checks the VM_FRAME_FLAG_POPPED flag. With such branching possible, the compilation procedure can generate both sequences for popped and used return values, and let the branch decide which to use.

This optimisation is beyond this proposal. The amount of reconstructing compiler infrastructure is too much.

Related works

  • This proposal includes #1943.
    • #1943 includes #1804, which is also included here.
  • The idea of purity first appeared in #1419.

@shyouhei shyouhei changed the title Sendpop send-pop optimisation Mar 19, 2019

@mame

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

I'm not against the proposal, but I'd like to point very subtle incompatibility.

def foo
  x = :before
  $bndg = binding
  x = :after
  x
end

foo
p $bndg.local_variable_get(:x) #=> :after in trunk, :before in sendpop

The example above is very artificial, but there might be a code like:

def start_thread
  initialized = false
  Thread.new do
    nil until initialized # spin lock
    p :ok
  end
  # some initialization code...
  initialized = true
end

start_thread
sleep

I guess that few programmers will fall into this pit. But if a unfortunate programmer did, the behavior would look extremely mysterious.

shyouhei added some commits May 24, 2016

new VM timestamp variable
This variable is expected to be an integer type which can be incremented
atomically.  Expected to be used where certain object's "freshness" is
vital, e.g. when invalidating a cache.
allow accessing unified operands from attributes
Attributes of normal instructions can look at their operands.
This changeset enables the same thing for operand-unified
instructions.
new attribute trace_equivalent
Some instructions are spacial cases for another ones.  These
instructions need not preserve their trace counterparts.  By reducing
those unnecessary trace instructions we can strip binary size of
vm_exec_core from 25,759 bytes to 24,924 bytes on my machine.

Yes, this changeset slows traces down a bit.  But is that a problem?
define purity of each instructions
This changeset introduces new instruction attribute called "purity".
By doing so we can eliminate calls to methods that are entirely
consist of pure instructions.

The definition of purity is chosen to achieve that optimisation; that
is, only instructions that do noting except stack manipulations are
marked so.  For instance instruction `once` is not pure, because it
can block other threads (so there can potentially be global side
effects).

A method call can both be pure and not pure at the same time.  What is
called at a specific call site cannot be determined until the very
moment when we actually call it.  Of course a method cannot be pure,
until every and all of methods it calls are (possibly recursively)
pure.  So in short, purity of a method is updated on-the-fly.
skip pure methods
This changeset modifies several instructions so that if the methods
(or blocks) that are about to be invoked are pure, just does nothing.
Fix [GH-1943]
send-pop optimisation part one: calling convention
Sending a method, then immediately throwing away its return value,
is one of the most frequent waste of time that ruby does.  Why not
tell methods if the caller uses that return value or not, and let
them use that info for optimal operations.

In order to do so our method calling convention is extended to have
VM_FRAME_FLAG_POPPED bit which indicates that the caller does not
use the return value.  There also is a new instruction called
opt_bailout, which omits creation of return values to bail out early.
add rb_whether_the_return_value_is_used_p()
Looking at how `make rdoc` is working, I noticed that strings
allocated inside of StringScanner#scan (which is called from
lib/rdoc/markup/parser.rb:508, "else @s.scan ...") are becoming
garbages immediately.  Why not make it possible for extension
libraries to know whether the return values are needed or not.  That
way StrigngScanner can avoid generation of such useless strings, to
reduce the GC pressure.
omit branch inside of vm_sendish
Now that opt_bailout is introduced, a method that is entirely pure can
have that instruction at the very beginning of its sequence.  Why not
just invoke such methods as usual and let the instruction do the job.
This adds some overhead (frame manipulations previously entirely
skipped can now occur) and removes another (purity calculations for
non-skippable method calls now eliminated).  So let's see the
trade-offs.

@shyouhei shyouhei force-pushed the shyouhei:sendpop branch from d2dc345 to 632459d Mar 20, 2019

shyouhei added some commits Dec 3, 2018

optimise String#slice!
Looking at how `make rdoc` is working, I noticed that strings
allocated inside of String#slice! (which is called from
lib/rdoc/markup/parser.rb:313 and several other places) are becoming
garbages immediately.  These usages of String#slice! are to delete
portions of the receiver and are not interested in the return values.
Why not avoid creation of the return value in such cases.

Note however that by doing so, String#slice is inevitably made
optimised also.  These two methods are tightly connected.  Decoupling
them needs lots of copy & paste, which I think is not a good idea.
optimise Enumerable#grep
Enumerable#grep is interesting in two things.  First, despite
almost everybody think it has nothing to do with return value
optimisations, it does.  The usage without return value can be
seen at ext/extmk.rb:368, "grep(/\A#{var}=(.*)/) {return $1}".
Second, even when there is no block passed and no return value
used at the same time, it cannot be a no-op.  We have to reroute
[Bug #5801].
add RubyVM.return_value_is_used?
RDoc::Parser::RubyTools#skip_tkspace_without_nl is one of methods that
is frequently called with return value discarded.  Eliminating the
allocated array can benefit both time and memory consumption.

The problem is, it is hard to auto-eliminate such wasted return values
even when we can tell the method we don't need them.  This is because
variables _could_ escape from the scope.  For instance, uget_tk()
might be an alias of eval().  That is not the case for this particular
method, but auto-detecting such evil activities are very hard -- if not
impossible.

So to ease the situation we implement RubyVM.return_value_is_used?
method.  By manually checking that property we can define hand-crafted
faster variation of skip_tkspace_without_nl that do not allocate the
return values.
send-pop optimisation part two: eliminate pop
Sending a method, then immediately throwing away its return value, is
one of the most frequent waste of time that ruby does.  Now that
callee methods can skip pushing objects onto the stack, why not caller
sites to also avoid popping them.

In order to do so our compiler now does not emit pop instructions but
add VM_FRAME_FLAG_POPIT flag to the call info of adjacent send-ish
instructions.  It is now the caller's duty to properly igonre the
return value.

Signed-off-by: Urabe, Shyouhei <shyouhei@ruby-lang.org>
optimise rb_obj_dummy
Looking at discourse script/bench.rb, I found that
BasicObject#initialize is called a considerable number of times.  It
seems worth optimising.  By making sure we are calling rb_obj_dummy,
we can safely skip the frame manipulations.
modify tests to properly trigger JIT
These methods were lightweight enough for the interpreter to avoid
JIT compilations.  Make them a little comlicated so that they can
be properly considered for optimisations by the engine.
precalc INSN_CALLER_RETVAL_POPPED_P()
This macro is expanded inside of send-ish instructions, which are
super-duper hot paths.  By statically analysing this into the call
info, we can optimise situations where return values _do_ get used.
reduce instruction counts
Experiments show that some recent compilers give up inlining
functions called from inside of vm_exec_core, seemingly because
it was too big.  This changeset deletes sendpop instructions,
merge them into bare ones.
no inline vm_method_cfunc_is()
It seems vm_method_cfunc_is() is inlined into vm_exec_core(), wihch
is not what we want here.  Make a wrapper function to absorb that.
optcarrot tweak
I admit this is a dirty hack just to boost optcarrot FPS.

@shyouhei shyouhei force-pushed the shyouhei:sendpop branch from 632459d to 59baad7 Mar 20, 2019

@shyouhei

This comment has been minimized.

Copy link
Member Author

commented Mar 20, 2019

@mame Good catch, pushed 90c3d4b which fixes the problem.

cancel optimisation when captured
If there are chances for local variables to live longer than the
original scope, we cannot eliminate local variable assignments.
Optimisations are not possible then.

@shyouhei shyouhei force-pushed the shyouhei:sendpop branch from 90c3d4b to ce8b02f Mar 21, 2019

@noahgibbs

This comment has been minimized.

Copy link

commented Apr 27, 2019

I tried running this branch w/ Rails Ruby Bench (SHA before: 139634, after: ce8b02f). After 90 batches of 30k req/batch, it's still well within the margin of measurement error:

before: median throughput: 185.5, variance 5.13, std dev 2.26
after: median throughput: 184.7, variance 2.61, std dev 1.62

So if this makes a difference for Rails Ruby Bench at all, it's a small one.

@shyouhei

This comment has been minimized.

Copy link
Member Author

commented Apr 27, 2019

@noahgibbs Thank you for trying and very insightful data! I agree that this proposal does not boost Rails apps 3x faster. The thing optimised here does not take a major amount of time. That said, your measurement kind of disappointed me. Variance and stddev improvements can be explained by reduced memory access, but I also expected median throughputs to become better. Will take a closer look at my local discourse benchmarks. Thanks again!

@noahgibbs

This comment has been minimized.

Copy link

commented Apr 27, 2019

I had been hoping for better as well. This may be a case where Discourse isn't benefiting as much as other code would.

@k0kubun k0kubun changed the base branch from trunk to master Aug 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.