New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature #9362] Minimize cache misshit to gain optimal speed #495

Closed
wants to merge 52 commits into
base: trunk
from

Conversation

5 participants
@shyouhei
Member

shyouhei commented Jan 3, 2014

Cachelined: A ruby improvement

Abstract

It's faster, even compared to 2.1.

Introduction

Ruby is an object oriented language. Although it is normal these days, "everything is an object" has been a key characteristic of this language.

The "object", in practice, is stored as a C struct named struct RVALUE (with a few exceptions such as true, false). This struct is a 5 machine-word sized structure. Its first word is management area mainly for flags, the next word is a pointer to that object's class if any, and the remaining 3 words are dedicated for each classes.

struct RBasic {
    VALUE flags;
    const VALUE klass;
};

typedef struct RVALUE {
    union {
        struct {
            struct RBasic basic;
            VALUE v1;
            VALUE v2;
            VALUE v3;
        } values;
        …
    } as;
} RVALUE;

The problem is, 5 is a prime number. So cache mechanisms of any size cannot store this struct efficiently. Most notably, CPUs have been equipped with data caches since their mid age; Ruby's objects do not suit there. That does not always mean a breakage but significant slowdown is happening.

Today I'd like to propose a fix around this; to make objects power-of-two sized. What I did was to make objects 8 words length instead of 5. By doing so an object, most importantly its struct RBasic part, is forced to fit in a same cache line.

A side effect is that the extended areas of each objects can be used to store additional info. For instance strings can hold up to 48 bytes inside their objects; most short strings are now embedded, which reduces memory allocations.

Cache Lines

It is not the recent development. At least, Intel i386 could use upto 64KiB L1 cache. But since CPUs get faster and faster, the importance of cache is rising rapidly.

When CPU tries to retrieve an area of memory, no matter how many bytes it requested, a bunch of memory regions are loaded anyway for later uses. This bunch is called "cache line". The size of that line vary from model to model, but most recent CPUs use 64 bytes. So whenever you poke a memory, 64 bytes are concerned at once.

Like I mentioned above ruby objects are (were) 5 words width, or 40 bytes. Objects are tightly arranged inside memory. If an object starts from 0 byte offset from cache line, the next object is 24 bytes on the current cache, but remaining 16 bytes are not. As 5 is a prime, the size of an object and the cache line size are pairwise disjoint. This means every patterns of placements are possible. Only 3 out of 8 possible arrangement holds entire object at once in a cache line. All other cases need to access real memory twice.

Our Approach

To fix this issue is simple; just make ruby objects large enough so that they can tightly fit into each cache lines. By carefully aligning initial allocation, we can force every objects to be cache line aligned. By dong so every access to an object is guaranteed to have at most one physical memory access.

Embedding Others

Interesting side effect of expanding object width, is that it eliminates some memory allocations.

Several ruby objects "embeds" their contents when possible. These objects include strings, arrays, hashes, and instances of pure-ruby classes; that is, a wide range of popular objects do so. Now, the width of an object is extended. This means there are much more rooms for those embedding objects. For instance arrays can now embed upto 6 elements. Hashes can hold upto 3 key-value pairs. Strings can hold upto 48 bytes. These relatively small objects are now self-contained. They can avoid extra cost of allocations.

Experiments

To determine effectiveness of this approach I took several experiments on my vaio pro laptop. This machine is 2 physical / 4 logical core Haswell equipped, running Linux 3.12.0 (3.12.0 was needed for Linux to support Intel p-state on this chip).

Results of make benchmark

Here is the result of make benchmark, against ruby 2.1.1p2, 2.0.0p376, 1.9.3p488, and ours (trunk r44485 + our patch); all compiled from source, same compiler (clang 3.4), same options.

The result of proposed approach is very similar to 2.1.1p2. It seems they are virtually identical. But our approach is the fastest for most cases. Most are few % faster than 2.1.1p2; which is typical to cache optimizations. Several cases earn more speedup because of side effect mentioned above, most notably is bm_vm3_gc; which in fact used most if its time around hash allocations.

Results of make rdoc

To measure impacts of our approach to a real-world program, make rdoc XRUBY=${target} is tested against the set of ruby versions mentioned above. However, 1.9.3 was not able to make it through due to the version RDoc supports. So it is not on the graph.

Here again, our approach is the fastest. RDoc is almost 100% pure-ruby, so hash optimizations benefit it very much.

Memory usage

One key concern about our approach is the increasing size of objects; that can impact on memory footprint. Though at the same time many small memory regions (notably hashes) are packed into one, which can partially reduce memory usages. How much is used in fact?

To measure memory usages I used valgrind memory profiler. It can profile memory usage in non-disruptive manner. Here are memory footprints of rdoc generation:


Two graphs above are the memory profiles for both ours and 2.1.1p2. It shows that the memory footprint does blow up. However I also have to note that the blew up ratio is 260.5 / 207.4 == 1.256, which is below what is expected from the object size (8 words / 5 words == 1.6 times bigger). If you take a closer look at them, you see the amount acquired by heap_assign_page() is increasing, while that of objspace_xmalloc() is decreasing. This means some part of memories originally taken externally are now embedded into objects; which clearly describes the side effect we expected.

Cache profile

Valgrind can also profile cache misshits. However this feature is ultra slow, not suitable for real-world programs like RDoc. So I tested ruby --disable-gems -e"0x400000.times { Object.new }". The output below is cache misshit delta between ours versus 2.1.1p2.

zsh % cg_diff cachegrind.out.cachelined cachegrind.out.211p2 | cg_annotate /dev/stdin
--------------------------------------------------------------------------------
Files compared:   cachegrind.out.cachelined; cachegrind.out.211p2
Command:          ./ruby --disable-gems -e0x400000.times { Object.new }; /home/shyouhei/data/target/ruby_2_1/bin/ruby --disable-gems -e0x400000.times { Object.new }
Data file:        /dev/stdin
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
        Ir  I1mr ILmr      Dr       D1mr DLmr      Dw    D1mw    DLmw 
--------------------------------------------------------------------------------
12,772,600 3,381  -36 742,954 -1,725,002    2 222,942 -30,220 -14,810  PROGRAM TOTALS

--------------------------------------------------------------------------------
       Ir  I1mr ILmr      Dr       D1mr DLmr        Dw    D1mw    DLmw  file:function
--------------------------------------------------------------------------------
8,490,879    13   -1 -65,345 -1,553,377    0  -102,802 -19,959       0  gc.c:gc_heap_lazy_sweep
1,435,366   -12    1 247,409     -1,476    0   144,589     436       0  gc.c:gc_mark
  938,264     4    1  42,615     -3,114    0    32,188    -328       0  gc.c:gc_marks_body
  732,366   -23   -2 180,170     23,350    0    91,665      22       0  gc.c:gc_mark_children
  544,747    -1    2  72,222        -68    0     1,826       0       0  gc.c:mark_current_machine_context
  516,747   -11    0 167,970     21,331    0    54,456      59       0  st.c:st_foreach
  226,596  -368    0  78,816     10,844    0    39,408    -353       0  gc.c:mark_method_entry_i
 -204,979   -33    1 -45,053   -152,686    0   -38,672  12,178       2  gc.c:newobj_of
  119,003  -365    0  36,875      5,999    0     3,186       0       0  st.c:st_foreach_check
 -100,022   -93    1 -11,628       -556    0   -37,545 -15,310 -14,791  gc.c:heap_assign_page
  -95,217   -37    0 -34,943       -125    0         0       0       0  gc.c:rb_gc_mark_maybe
   76,464     0    0  33,453      2,771    0    23,895       0       0  gc.c:mark_const_entry_i
   34,515     5    0  10,620          7    0     5,310       0       0  hash.c:foreach_safe_i
   29,571    21    0   3,942        296    0     4,854       0       0  gc.c:rb_gc_mark_locations
   29,205     0    0  15,930      1,177    0     5,310       0       0  variable.c:mark_global_entry
   27,317   -42    0   4,897        708    0       944       0       0  vm.c:rb_vm_mark
   22,656    -1    0   7,552          0    0         0       0       0  gc.c:rb_gc_mark
  -20,253     0    0  -2,421     -7,125    0       -27       1       0  gc.c:rb_gc_call_finalizer_at_exit
  -19,602    21    0  -4,340     -1,669    0    -5,874  -6,707       0  gc.c:garbage_collect_body
  -18,335   134    0  -9,156         -7    0    -3,061      -3       0  parse.y:rb_intern2
  -18,131   -33   -2  -5,163       -229    0    -2,464     -44       3  /build/buildd/eglibc-2.17/malloc/malloc.c:_int_free
  -16,541    60    0  -2,080       -281    0    -2,300     -65      83  /build/buildd/eglibc-2.17/malloc/malloc.c:_int_malloc
   16,284     8    0   8,142          0    0     5,428      -1       0  gc.c:mark_entry
   14,809  -377    0   4,189        881    0     2,773     118       0  vm.c:rb_thread_mark
   14,342  -119   -1   3,611        474    0     1,892      59       0  gc.c:gc_mark_roots

It shows that data read ("Dr") is increasing, while cache misshit ("D1mr") is decreasing. This conforms very much with the memory profile; the footprint increases so data read counts up, while cache misshit decreases as designed to be.

Conclusion

Proposed is a way optimize cache misshits in ruby's object system. It speeds up both benchmarks and real-world programs, by sacrificing memory footprint a bit.

UPDATE charts updated as I recompiled ruby using clang 3.4.

shyouhei added some commits Nov 23, 2013

flexible object size
This change introduces struct RValueStorage, so that objects can
be any compile-time defined width.

Note however, that some objects embeds others by nature, and so
they need length info packed into their flags.  Most are OK but
Strings have no space left in their flags to introduce wider
buffers (because they also embed encoding index in it).  In order
to cover that we needed to use upper-half of the flag bits.  If
there is no such thing like upper halfs on your machine, Strings
cannot use full amount of spaces that they are promised to have.
add STATIC_ASSERT
Now that embed lengths are computed automatically it is not that
ovbuous than befre whether the calculated lengths are OK.  We are
wiser to machine-check.
introducing new struct REmbedHash
Now that there are bunch of rooms in each objects, let's utilize
Hash's unused areas by filling in its content objects into there.
This should speed up creating small hashes like { foo: 'bar' }.
array.c needs care about REmbedHash.
Why not just call rb_hash_clear?  That should avoid touching hash
internals from array.c.
our GC have to know about REmbedHash.
Now that T_HASH has both embedded/non-embedded style, GC cannot
blindly assume there are st_table for each hashes.
RHASH_SIZE() is much more complicated now.
I could have moved this inline function into ruby.h, but sad news
is that PROPER implementation of RHASH_SIZE needs st.h, and ruby.h
is not meant to include it.  Original implementation magically
avoided this by defering access to the struct using preprocessor.
new hash instances are now default embedding
The technique used here is that GCC (also clang) implements much
richer literal syntax than standard C so a single assignment can
express much more than you think.  Compare those two #ifdef parts
and see the reduced cyclomatic complexity for GCC part, which
means more rooms for optimizations.
implement Hash#default=
Hashes that have dedicated defaults cannot embed someone else,
except when that default is a nil.  This implementation is
intuitive I believe.
implement Hash#[]
Hash that embeds someone else has fixed size so the search ends in
amortized O(1) time, which does not change the order of generic Hash.
implement Hash#[]=
If there is a room for the hash, use that.  If not, first comvert it
into ordinal hash and then insert into the st_table.
rb_hash_modify() to automatically explode
To reduce the pach size, I fallback to explode whenever modifications
(other than rb_hash_aset) is operated.  Should have no effect except
performance degression.
implement Hash#==
This should have been implemented more effectively.  But to minimize
the patch for the sake of pull-request, I took a simpler, much
unefficient way.  This is to be fixed after this branch lands.
implement Hash#dup
I hope this is OK... Isn't this?
implement Hash.[]
was a bit tricky.  The argument hash may or may not be embedding and
both cases shall be considered.
implement Hash#each
All operations that needs iteration over a hash are hereby
considered to be too complex for embedding hashes.  This is
ineffective than ideal, but reduces much lines of patch, and
works for me anyway.
implement Hash#delete
Hash that embeds do not permit "hole"s inside its body.  So a
deletion needs to rearrange remaining contents.  This is much like
the situation in Array#delete_at.
implement Hash#to_h
The part in question is in fact not for Hash#to_h, but for its child
class to share this method implementation.  That case, if the self
is embedding someone else, the returing object should also.
Conversely if it is an ordinal hash, the rb_hash_new() generates an
embedding one by default and that does not suit.  So we explode it.
implement Hash#has_key?
This is intuitive.
implement Hash#compare_by_identity?
Is false for embedding hashs.
implement Hash#vales
same as Hash#keys.
support marshalling REmbedHash
I decided not to introduce incompatiblities into marshal format.
Hashes are always exploded before marshalling, and are always
imploded while unmarshalling.  This shall not have any impact,
except slight performance slow down.
Thread local storage REmbedHash support
I don't know why but Ruby's TLS use Ruby level Hashes but at the
same time directly touches hash internals.  Performance reason?
Or any esoteric reason to avoid rb_funcall?  Anyway we have to deal
with REmbedHash here also.
implement Hash#initialize_copy
Copying a hash should definitely be effective.  I took strategy so
so that any copy of an embedding hash should become embedding.
implement Array#|
I see it is hard to implement this method without exploding.  It
directly uses st_update and I see it is reasonable.
fuck rb_obj_clone()
That function blindly clobbers flags of generated objects.  We
need to take care about it.
use L1 cache coherency line size
The size of RValueStorage can be anything.  Ultimately setting this
to both 1 and 4096 should break nothing.  But to boost performance,
it is wiser to fit single RValueStorage into single cache line.
eliminate cache misshit
Our object space are homogeneous arrays of RVALUE.  So to align the
head of those spaces, we can let all RVALUEs cache aligned.
RGenGC awareness
Hashs are no longer defalt shady.  This should boost things up.
@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei

shyouhei Jan 3, 2014

Member

As being a committer I could have merged this as-is, but I wanted to request your comments because this patch has obvious drawback (binary breakage).

Member

shyouhei commented Jan 3, 2014

As being a committer I could have merged this as-is, but I wanted to request your comments because this patch has obvious drawback (binary breakage).

shyouhei added some commits Jan 3, 2014

dodge Travis' complaining about static functions
Though I believe it is legal to declare a function inside of a function.
@vendethiel

This comment has been minimized.

Show comment
Hide comment
@vendethiel

vendethiel Jan 3, 2014

Great job 👍 .

That might need a raising in ruby's malloc threshold?

vendethiel commented Jan 3, 2014

Great job 👍 .

That might need a raising in ruby's malloc threshold?

@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei

shyouhei Jan 3, 2014

Member

@Nami-Doc Can be. But I'm not sure if rising should boost things up (because of now-larger objects), or lowering (because of lesser # of malloc).

Member

shyouhei commented Jan 3, 2014

@Nami-Doc Can be. But I'm not sure if rising should boost things up (because of now-larger objects), or lowering (because of lesser # of malloc).

@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei
Member

shyouhei commented Jan 3, 2014

Eric Wong added some commits Jan 4, 2014

Eric Wong
ruby.h: use 6-bits for embedded array/struct length
This is to be compatible with 32-bit systems where
sizeof(VALUE) == 4 and cache line size is 64-bytes.
Eric Wong
test_objspace: increase string size for sharing
Embedded strings got bigger, so we must use bigger strings if we
want to test sharing.
Eric Wong
hash: fix RHASH_IFNONE
Embedded hashes have .ifnone == Qnil.
Eric Wong
test_hash: bump up hash size to force rehashing
Embedded hashes are not really hashes, so cannot be rehashed.
@infogulch

This comment has been minimized.

Show comment
Hide comment
@infogulch

infogulch Jan 4, 2014

Interesting. Have you tried 6-word objects? This would fit 4 objects into 3 cache lines, and 2/3 of the cache lines would contain complete objects, but only increase the size by 1/5 instead of 3/5. Are you sure your 3/8 arrangements with a complete object is right? It looks like 4/5 to me at first glance. (From HN)

infogulch commented Jan 4, 2014

Interesting. Have you tried 6-word objects? This would fit 4 objects into 3 cache lines, and 2/3 of the cache lines would contain complete objects, but only increase the size by 1/5 instead of 3/5. Are you sure your 3/8 arrangements with a complete object is right? It looks like 4/5 to me at first glance. (From HN)

@fleitz

This comment has been minimized.

Show comment
Hide comment
@fleitz

fleitz Jan 4, 2014

Cache line size is generally 64 bytes on x86, 8*8=64 so each cache line contains exactly 8 objects. This assumes a contiguous array of objects initially cache line aligned.

fleitz commented Jan 4, 2014

Cache line size is generally 64 bytes on x86, 8*8=64 so each cache line contains exactly 8 objects. This assumes a contiguous array of objects initially cache line aligned.

@infogulch

This comment has been minimized.

Show comment
Hide comment
@infogulch

infogulch Jan 4, 2014

@fleitz Re-read the pull request again. Right now, ruby objects take 5 words, not 5 bytes. This proposal makes ruby objects take 8 words, or one whole cache line per object (assuming 8 byte words and 64 bit cache lines).

infogulch commented Jan 4, 2014

@fleitz Re-read the pull request again. Right now, ruby objects take 5 words, not 5 bytes. This proposal makes ruby objects take 8 words, or one whole cache line per object (assuming 8 byte words and 64 bit cache lines).

Eric Wong and others added some commits Jan 4, 2014

Eric Wong Eric Wong
string.c: clear old flags when becoming embedded
We no longer overload the shared/assoc flags for embedded
strings 32-bytes or longer, so we cannot rely on setting the
embedded length to clear the shared/assoc flags.

Thus, a string which goes from:
	(1)no-embed -> (2)embed -> (3)no-embed
may inherit false shared/assoc flags from the original noembed form,
leading to assertion failures and segfaults.
@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei

shyouhei Jan 4, 2014

Member

OK so I found @infogulch's comment interesting. I did RDoc with changing object sizes one by one, and got this chart.

To obtain time and memory at once I used time make rdoc… so the mirage varied from my previous experiments, which used valgrind.

Observed are:

  • Case of 5 words, which is also what 2.1 uses for object size, is still a bit faster than 2.1.
    • This is because our patch inlines objects more aggressively compared to vanilla 2.1.
  • Memory usages are monotonic increasing.
  • 8 words is the fastest.
  • 7 words is (surprisingly) faster than 6 words, but only a little bit.
  • Anything bigger than 8 seems no use.

I see 6 words is another good choice between memory - time tradeoff. It uses almost same amount of memory than before, and is much faster.

Member

shyouhei commented Jan 4, 2014

OK so I found @infogulch's comment interesting. I did RDoc with changing object sizes one by one, and got this chart.

To obtain time and memory at once I used time make rdoc… so the mirage varied from my previous experiments, which used valgrind.

Observed are:

  • Case of 5 words, which is also what 2.1 uses for object size, is still a bit faster than 2.1.
    • This is because our patch inlines objects more aggressively compared to vanilla 2.1.
  • Memory usages are monotonic increasing.
  • 8 words is the fastest.
  • 7 words is (surprisingly) faster than 6 words, but only a little bit.
  • Anything bigger than 8 seems no use.

I see 6 words is another good choice between memory - time tradeoff. It uses almost same amount of memory than before, and is much faster.

Eric Wong and others added some commits Jan 5, 2014

Eric Wong
hash.c: do not explode on Hash#hash
This fixes exploding of recursive hashes, as inserting the hash into
itself would trigger an explode and lead to a corrupted hash and
wasted memory.
@rurounijones

This comment has been minimized.

Show comment
Hide comment
@rurounijones

rurounijones Jan 6, 2014

Would it be worth setting the word size as a compile time flag with Default 6 maybe (Assuming this is possible) so that, in the cases where the extra speed is super important and worth the memory increase (Heavy duty processing for example) the option to use 8 words is available?

rurounijones commented Jan 6, 2014

Would it be worth setting the word size as a compile time flag with Default 6 maybe (Assuming this is possible) so that, in the cases where the extra speed is super important and worth the memory increase (Heavy duty processing for example) the option to use 8 words is available?

Eric Wong
hash: fix GC crash during Hash#rehash
The temporary hash has the embed flag set, so if GC attempts to
mark it, it can clobber ntbl.  For safety's sake, use a regular
hash and explode it before giving it ->ntbl.
@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei

shyouhei Jan 6, 2014

Member

@rurounijones Making object size configurable (is in fact possible technically) confuses extension libraries. Different object sizes are of course binary incompatible. By doing so end-users should encounter unknown SEGV by corrupted objects, which are very hard to tackle. So I 'd like to prefer everyone to use a better default.

It's completely OK with me to choose 6.

Member

shyouhei commented Jan 6, 2014

@rurounijones Making object size configurable (is in fact possible technically) confuses extension libraries. Different object sizes are of course binary incompatible. By doing so end-users should encounter unknown SEGV by corrupted objects, which are very hard to tackle. So I 'd like to prefer everyone to use a better default.

It's completely OK with me to choose 6.

@shyouhei

This comment has been minimized.

Show comment
Hide comment
@shyouhei

shyouhei Feb 17, 2014

Member

CF: [ruby-core:60784]

In the latest ruby-core meetup we agreed to reject this particular patch because it comsumes too much memory. So I close this now. I believe the concept itself is still valid though. Let me brush things up.

Look forward my second try.

Member

shyouhei commented Feb 17, 2014

CF: [ruby-core:60784]

In the latest ruby-core meetup we agreed to reject this particular patch because it comsumes too much memory. So I close this now. I believe the concept itself is still valid though. Let me brush things up.

Look forward my second try.

@shyouhei shyouhei closed this Feb 17, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment