[Feature #9362] Minimize cache misshit to gain optimal speed #495

Closed
wants to merge 52 commits into
from

5 participants

@shyouhei
Member
shyouhei commented Jan 3, 2014

Cachelined: A ruby improvement

Abstract

It's faster, even compared to 2.1.

Introduction

Ruby is an object oriented language. Although it is normal these days, "everything is an object" has been a key characteristic of this language.

The "object", in practice, is stored as a C struct named struct RVALUE (with a few exceptions such as true, false). This struct is a 5 machine-word sized structure. Its first word is management area mainly for flags, the next word is a pointer to that object's class if any, and the remaining 3 words are dedicated for each classes.

struct RBasic {
    VALUE flags;
    const VALUE klass;
};

typedef struct RVALUE {
    union {
        struct {
            struct RBasic basic;
            VALUE v1;
            VALUE v2;
            VALUE v3;
        } values;
        …
    } as;
} RVALUE;

The problem is, 5 is a prime number. So cache mechanisms of any size cannot store this struct efficiently. Most notably, CPUs have been equipped with data caches since their mid age; Ruby's objects do not suit there. That does not always mean a breakage but significant slowdown is happening.

Today I'd like to propose a fix around this; to make objects power-of-two sized. What I did was to make objects 8 words length instead of 5. By doing so an object, most importantly its struct RBasic part, is forced to fit in a same cache line.

A side effect is that the extended areas of each objects can be used to store additional info. For instance strings can hold up to 48 bytes inside their objects; most short strings are now embedded, which reduces memory allocations.

Cache Lines

It is not the recent development. At least, Intel i386 could use upto 64KiB L1 cache. But since CPUs get faster and faster, the importance of cache is rising rapidly.

When CPU tries to retrieve an area of memory, no matter how many bytes it requested, a bunch of memory regions are loaded anyway for later uses. This bunch is called "cache line". The size of that line vary from model to model, but most recent CPUs use 64 bytes. So whenever you poke a memory, 64 bytes are concerned at once.

Like I mentioned above ruby objects are (were) 5 words width, or 40 bytes. Objects are tightly arranged inside memory. If an object starts from 0 byte offset from cache line, the next object is 24 bytes on the current cache, but remaining 16 bytes are not. As 5 is a prime, the size of an object and the cache line size are pairwise disjoint. This means every patterns of placements are possible. Only 3 out of 8 possible arrangement holds entire object at once in a cache line. All other cases need to access real memory twice.

Our Approach

To fix this issue is simple; just make ruby objects large enough so that they can tightly fit into each cache lines. By carefully aligning initial allocation, we can force every objects to be cache line aligned. By dong so every access to an object is guaranteed to have at most one physical memory access.

Embedding Others

Interesting side effect of expanding object width, is that it eliminates some memory allocations.

Several ruby objects "embeds" their contents when possible. These objects include strings, arrays, hashes, and instances of pure-ruby classes; that is, a wide range of popular objects do so. Now, the width of an object is extended. This means there are much more rooms for those embedding objects. For instance arrays can now embed upto 6 elements. Hashes can hold upto 3 key-value pairs. Strings can hold upto 48 bytes. These relatively small objects are now self-contained. They can avoid extra cost of allocations.

Experiments

To determine effectiveness of this approach I took several experiments on my vaio pro laptop. This machine is 2 physical / 4 logical core Haswell equipped, running Linux 3.12.0 (3.12.0 was needed for Linux to support Intel p-state on this chip).

Results of make benchmark

Here is the result of make benchmark, against ruby 2.1.1p2, 2.0.0p376, 1.9.3p488, and ours (trunk r44485 + our patch); all compiled from source, same compiler (clang 3.4), same options.

The result of proposed approach is very similar to 2.1.1p2. It seems they are virtually identical. But our approach is the fastest for most cases. Most are few % faster than 2.1.1p2; which is typical to cache optimizations. Several cases earn more speedup because of side effect mentioned above, most notably is bm_vm3_gc; which in fact used most if its time around hash allocations.

Results of make rdoc

To measure impacts of our approach to a real-world program, make rdoc XRUBY=${target} is tested against the set of ruby versions mentioned above. However, 1.9.3 was not able to make it through due to the version RDoc supports. So it is not on the graph.

Here again, our approach is the fastest. RDoc is almost 100% pure-ruby, so hash optimizations benefit it very much.

Memory usage

One key concern about our approach is the increasing size of objects; that can impact on memory footprint. Though at the same time many small memory regions (notably hashes) are packed into one, which can partially reduce memory usages. How much is used in fact?

To measure memory usages I used valgrind memory profiler. It can profile memory usage in non-disruptive manner. Here are memory footprints of rdoc generation:


Two graphs above are the memory profiles for both ours and 2.1.1p2. It shows that the memory footprint does blow up. However I also have to note that the blew up ratio is 260.5 / 207.4 == 1.256, which is below what is expected from the object size (8 words / 5 words == 1.6 times bigger). If you take a closer look at them, you see the amount acquired by heap_assign_page() is increasing, while that of objspace_xmalloc() is decreasing. This means some part of memories originally taken externally are now embedded into objects; which clearly describes the side effect we expected.

Cache profile

Valgrind can also profile cache misshits. However this feature is ultra slow, not suitable for real-world programs like RDoc. So I tested ruby --disable-gems -e"0x400000.times { Object.new }". The output below is cache misshit delta between ours versus 2.1.1p2.

zsh % cg_diff cachegrind.out.cachelined cachegrind.out.211p2 | cg_annotate /dev/stdin
--------------------------------------------------------------------------------
Files compared:   cachegrind.out.cachelined; cachegrind.out.211p2
Command:          ./ruby --disable-gems -e0x400000.times { Object.new }; /home/shyouhei/data/target/ruby_2_1/bin/ruby --disable-gems -e0x400000.times { Object.new }
Data file:        /dev/stdin
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
        Ir  I1mr ILmr      Dr       D1mr DLmr      Dw    D1mw    DLmw 
--------------------------------------------------------------------------------
12,772,600 3,381  -36 742,954 -1,725,002    2 222,942 -30,220 -14,810  PROGRAM TOTALS

--------------------------------------------------------------------------------
       Ir  I1mr ILmr      Dr       D1mr DLmr        Dw    D1mw    DLmw  file:function
--------------------------------------------------------------------------------
8,490,879    13   -1 -65,345 -1,553,377    0  -102,802 -19,959       0  gc.c:gc_heap_lazy_sweep
1,435,366   -12    1 247,409     -1,476    0   144,589     436       0  gc.c:gc_mark
  938,264     4    1  42,615     -3,114    0    32,188    -328       0  gc.c:gc_marks_body
  732,366   -23   -2 180,170     23,350    0    91,665      22       0  gc.c:gc_mark_children
  544,747    -1    2  72,222        -68    0     1,826       0       0  gc.c:mark_current_machine_context
  516,747   -11    0 167,970     21,331    0    54,456      59       0  st.c:st_foreach
  226,596  -368    0  78,816     10,844    0    39,408    -353       0  gc.c:mark_method_entry_i
 -204,979   -33    1 -45,053   -152,686    0   -38,672  12,178       2  gc.c:newobj_of
  119,003  -365    0  36,875      5,999    0     3,186       0       0  st.c:st_foreach_check
 -100,022   -93    1 -11,628       -556    0   -37,545 -15,310 -14,791  gc.c:heap_assign_page
  -95,217   -37    0 -34,943       -125    0         0       0       0  gc.c:rb_gc_mark_maybe
   76,464     0    0  33,453      2,771    0    23,895       0       0  gc.c:mark_const_entry_i
   34,515     5    0  10,620          7    0     5,310       0       0  hash.c:foreach_safe_i
   29,571    21    0   3,942        296    0     4,854       0       0  gc.c:rb_gc_mark_locations
   29,205     0    0  15,930      1,177    0     5,310       0       0  variable.c:mark_global_entry
   27,317   -42    0   4,897        708    0       944       0       0  vm.c:rb_vm_mark
   22,656    -1    0   7,552          0    0         0       0       0  gc.c:rb_gc_mark
  -20,253     0    0  -2,421     -7,125    0       -27       1       0  gc.c:rb_gc_call_finalizer_at_exit
  -19,602    21    0  -4,340     -1,669    0    -5,874  -6,707       0  gc.c:garbage_collect_body
  -18,335   134    0  -9,156         -7    0    -3,061      -3       0  parse.y:rb_intern2
  -18,131   -33   -2  -5,163       -229    0    -2,464     -44       3  /build/buildd/eglibc-2.17/malloc/malloc.c:_int_free
  -16,541    60    0  -2,080       -281    0    -2,300     -65      83  /build/buildd/eglibc-2.17/malloc/malloc.c:_int_malloc
   16,284     8    0   8,142          0    0     5,428      -1       0  gc.c:mark_entry
   14,809  -377    0   4,189        881    0     2,773     118       0  vm.c:rb_thread_mark
   14,342  -119   -1   3,611        474    0     1,892      59       0  gc.c:gc_mark_roots

It shows that data read ("Dr") is increasing, while cache misshit ("D1mr") is decreasing. This conforms very much with the memory profile; the footprint increases so data read counts up, while cache misshit decreases as designed to be.

Conclusion

Proposed is a way optimize cache misshits in ruby's object system. It speeds up both benchmarks and real-world programs, by sacrificing memory footprint a bit.

UPDATE charts updated as I recompiled ruby using clang 3.4.

shyouhei added some commits Nov 23, 2013
@shyouhei shyouhei flexible object size
This change introduces struct RValueStorage, so that objects can
be any compile-time defined width.

Note however, that some objects embeds others by nature, and so
they need length info packed into their flags.  Most are OK but
Strings have no space left in their flags to introduce wider
buffers (because they also embed encoding index in it).  In order
to cover that we needed to use upper-half of the flag bits.  If
there is no such thing like upper halfs on your machine, Strings
cannot use full amount of spaces that they are promised to have.
c558ad4
@shyouhei shyouhei add STATIC_ASSERT
Now that embed lengths are computed automatically it is not that
ovbuous than befre whether the calculated lengths are OK.  We are
wiser to machine-check.
c084b81
@shyouhei shyouhei introducing new struct REmbedHash
Now that there are bunch of rooms in each objects, let's utilize
Hash's unused areas by filling in its content objects into there.
This should speed up creating small hashes like { foo: 'bar' }.
7a5c847
@shyouhei shyouhei array.c needs care about REmbedHash.
Why not just call rb_hash_clear?  That should avoid touching hash
internals from array.c.
8c735ca
@shyouhei shyouhei our GC have to know about REmbedHash.
Now that T_HASH has both embedded/non-embedded style, GC cannot
blindly assume there are st_table for each hashes.
02ad7c6
@shyouhei shyouhei RHASH_SIZE() is much more complicated now.
I could have moved this inline function into ruby.h, but sad news
is that PROPER implementation of RHASH_SIZE needs st.h, and ruby.h
is not meant to include it.  Original implementation magically
avoided this by defering access to the struct using preprocessor.
7682e52
@shyouhei shyouhei .gdbinit support for REmbedHash 7afe2d3
@shyouhei shyouhei new hash instances are now default embedding
The technique used here is that GCC (also clang) implements much
richer literal syntax than standard C so a single assignment can
express much more than you think.  Compare those two #ifdef parts
and see the reduced cyclomatic complexity for GCC part, which
means more rooms for optimizations.
6a9c00f
@shyouhei shyouhei implement Hash#default=
Hashes that have dedicated defaults cannot embed someone else,
except when that default is a nil.  This implementation is
intuitive I believe.
97d25a3
@shyouhei shyouhei implement Hash#[]
Hash that embeds someone else has fixed size so the search ends in
amortized O(1) time, which does not change the order of generic Hash.
dc9a4c4
@shyouhei shyouhei implement Hash#[]=
If there is a room for the hash, use that.  If not, first comvert it
into ordinal hash and then insert into the st_table.
6f20651
@shyouhei shyouhei rb_hash_modify() to automatically explode
To reduce the pach size, I fallback to explode whenever modifications
(other than rb_hash_aset) is operated.  Should have no effect except
performance degression.
e97f464
@shyouhei shyouhei implement Hash#fetch 08f5afb
@shyouhei shyouhei implement Hash#==
This should have been implemented more effectively.  But to minimize
the patch for the sake of pull-request, I took a simpler, much
unefficient way.  This is to be fixed after this branch lands.
3319e0a
@shyouhei shyouhei implement Hash#clear eeaada5
@shyouhei shyouhei implement Hash#dup
I hope this is OK... Isn't this?
e23efd3
@shyouhei shyouhei implement Hash#keys f7f13ca
@shyouhei shyouhei implement Hash.[]
was a bit tricky.  The argument hash may or may not be embedding and
both cases shall be considered.
968014d
@shyouhei shyouhei implement Hash#each
All operations that needs iteration over a hash are hereby
considered to be too complex for embedding hashes.  This is
ineffective than ideal, but reduces much lines of patch, and
works for me anyway.
15d1f2d
@shyouhei shyouhei implement Hash#delete
Hash that embeds do not permit "hole"s inside its body.  So a
deletion needs to rearrange remaining contents.  This is much like
the situation in Array#delete_at.
3ce088f
@shyouhei shyouhei implement Hash#to_h
The part in question is in fact not for Hash#to_h, but for its child
class to share this method implementation.  That case, if the self
is embedding someone else, the returing object should also.
Conversely if it is an ordinal hash, the rb_hash_new() generates an
embedding one by default and that does not suit.  So we explode it.
7bb04ef
@shyouhei shyouhei implement Hash#has_key?
This is intuitive.
f70a7cb
@shyouhei shyouhei implement Hash#compare_by_identity?
Is false for embedding hashs.
c05d4f5
@shyouhei shyouhei implement Hash#assoc dc2c3bb
@shyouhei shyouhei implement Hash#vales
same as Hash#keys.
9035427
@shyouhei shyouhei support marshalling REmbedHash
I decided not to introduce incompatiblities into marshal format.
Hashes are always exploded before marshalling, and are always
imploded while unmarshalling.  This shall not have any impact,
except slight performance slow down.
20d0af4
@shyouhei shyouhei Thread local storage REmbedHash support
I don't know why but Ruby's TLS use Ruby level Hashes but at the
same time directly touches hash internals.  Performance reason?
Or any esoteric reason to avoid rb_funcall?  Anyway we have to deal
with REmbedHash here also.
2080583
@shyouhei shyouhei implement Hash#initialize_copy
Copying a hash should definitely be effective.  I took strategy so
so that any copy of an embedding hash should become embedding.
6979867
@shyouhei shyouhei implement Array#|
I see it is hard to implement this method without exploding.  It
directly uses st_update and I see it is reasonable.
dcfe25c
@shyouhei shyouhei fuck rb_obj_clone()
That function blindly clobbers flags of generated objects.  We
need to take care about it.
869832d
@shyouhei shyouhei implement Hash#rehash e829480
@shyouhei shyouhei implement Hash#replace 8a7569b
@shyouhei shyouhei implement Hash#shift 8a90644
@shyouhei shyouhei use L1 cache coherency line size
The size of RValueStorage can be anything.  Ultimately setting this
to both 1 and 4096 should break nothing.  But to boost performance,
it is wiser to fit single RValueStorage into single cache line.
c411653
@shyouhei shyouhei eliminate cache misshit
Our object space are homogeneous arrays of RVALUE.  So to align the
head of those spaces, we can let all RVALUEs cache aligned.
e34c0ee
@shyouhei shyouhei implement Hash#select! 870734d
@shyouhei shyouhei RGenGC awareness
Hashs are no longer defalt shady.  This should boost things up.
6954c09
@shyouhei
Member
shyouhei commented Jan 3, 2014

As being a committer I could have merged this as-is, but I wanted to request your comments because this patch has obvious drawback (binary breakage).

shyouhei added some commits Jan 3, 2014
@shyouhei shyouhei dodge Travis' complaining about static functions
Though I believe it is legal to declare a function inside of a function.
e5ed75d
@shyouhei shyouhei Merge e5ed75d into 8f04556 67a86ca
@vendethiel

Great job 👍 .

That might need a raising in ruby's malloc threshold?

@shyouhei
Member
shyouhei commented Jan 3, 2014

@Nami-Doc Can be. But I'm not sure if rising should boost things up (because of now-larger objects), or lowering (because of lesser # of malloc).

@shyouhei
Member
shyouhei commented Jan 3, 2014
Eric Wong added some commits Jan 4, 2014
Eric Wong array.c: fix typo abecfa9
Eric Wong ruby.h: use 6-bits for embedded array/struct length
This is to be compatible with 32-bit systems where
sizeof(VALUE) == 4 and cache line size is 64-bytes.
d8bc0d2
Eric Wong test_set_len: update test to account for longer embedded strings fd6cae0
Eric Wong test_objspace: increase string size for sharing
Embedded strings got bigger, so we must use bigger strings if we
want to test sharing.
39448cb
Eric Wong hash: fix RHASH_IFNONE
Embedded hashes have .ifnone == Qnil.
c1c4455
Eric Wong test_hash: bump up hash size to force rehashing
Embedded hashes are not really hashes, so cannot be rehashed.
0686d7d
@infogulch

Interesting. Have you tried 6-word objects? This would fit 4 objects into 3 cache lines, and 2/3 of the cache lines would contain complete objects, but only increase the size by 1/5 instead of 3/5. Are you sure your 3/8 arrangements with a complete object is right? It looks like 4/5 to me at first glance. (From HN)

@fleitz
fleitz commented Jan 4, 2014

Cache line size is generally 64 bytes on x86, 8*8=64 so each cache line contains exactly 8 objects. This assumes a contiguous array of objects initially cache line aligned.

@infogulch

@fleitz Re-read the pull request again. Right now, ruby objects take 5 words, not 5 bytes. This proposal makes ruby objects take 8 words, or one whole cache line per object (assuming 8 byte words and 64 bit cache lines).

Eric Wong and others added some commits Jan 4, 2014
Eric Wong string.c: clear old flags when becoming embedded
We no longer overload the shared/assoc flags for embedded
strings 32-bytes or longer, so we cannot rely on setting the
embedded length to clear the shared/assoc flags.

Thus, a string which goes from:
	(1)no-embed -> (2)embed -> (3)no-embed
may inherit false shared/assoc flags from the original noembed form,
leading to assertion failures and segfaults.
87f1302
@shyouhei shyouhei Merge branch 'pull-495-fixes' of git://80x24.org/ruby into cachelined 814d988
@shyouhei shyouhei fix another typo 7f7b123
@shyouhei
Member
shyouhei commented Jan 4, 2014

OK so I found @infogulch's comment interesting. I did RDoc with changing object sizes one by one, and got this chart.

To obtain time and memory at once I used time make rdoc… so the mirage varied from my previous experiments, which used valgrind.

Observed are:

  • Case of 5 words, which is also what 2.1 uses for object size, is still a bit faster than 2.1.
    • This is because our patch inlines objects more aggressively compared to vanilla 2.1.
  • Memory usages are monotonic increasing.
  • 8 words is the fastest.
  • 7 words is (surprisingly) faster than 6 words, but only a little bit.
  • Anything bigger than 8 seems no use.

I see 6 words is another good choice between memory - time tradeoff. It uses almost same amount of memory than before, and is much faster.

Eric Wong and others added some commits Jan 5, 2014
Eric Wong hash.c: do not explode on Hash#hash
This fixes exploding of recursive hashes, as inserting the hash into
itself would trigger an explode and lead to a corrupted hash and
wasted memory.
4e4aa22
@shyouhei shyouhei Merge branch 'pull-495-fixes' of git://80x24.org/ruby into cachelined fe8820a
@rurounijones

Would it be worth setting the word size as a compile time flag with Default 6 maybe (Assuming this is possible) so that, in the cases where the extra speed is super important and worth the memory increase (Heavy duty processing for example) the option to use 8 words is available?

Eric Wong hash: fix GC crash during Hash#rehash
The temporary hash has the embed flag set, so if GC attempts to
mark it, it can clobber ntbl.  For safety's sake, use a regular
hash and explode it before giving it ->ntbl.
9d00d05
@shyouhei
Member
shyouhei commented Jan 6, 2014

@rurounijones Making object size configurable (is in fact possible technically) confuses extension libraries. Different object sizes are of course binary incompatible. By doing so end-users should encounter unknown SEGV by corrupted objects, which are very hard to tackle. So I 'd like to prefer everyone to use a better default.

It's completely OK with me to choose 6.

@shyouhei
Member

CF: [ruby-core:60784]

In the latest ruby-core meetup we agreed to reject this particular patch because it comsumes too much memory. So I close this now. I believe the concept itself is still valid though. Let me brush things up.

Look forward my second try.

@shyouhei shyouhei closed this Feb 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment