[WIP] add idle rax (with benchmarks) #7990

johncantrell97 · 2020-10-29T16:23:20Z

As discussed with @guybe7 in #7973

This is a test of adding a new rax for consumer groups and consumers that stores a PEL indexed on the entries delivery_time. This will allow for a more efficient and easier to use implementation of XAUTOCLAIM.

This test adds 1,000,000 entries to a stream where each entry just has a single field/value of the strings "field" and "value" via

redis.xadd(["mystream",'*','field','value'])

they are then all consumed by a single consumer using xreadgroup

I ran the benchmark a couple times on each unstable and idle-pel branch and the numbers are stable. I used /proc/<pid>/status to measure memory usage.

Before Running Benchmark

VmPeak:    53672 kB
VmSize:    53672 kB
VmHWM:     10104 kB
VmRSS:     10104 kB
RssAnon:         6480 kB
RssFile:         3624 kB
VmData:    43452 kB
VmPTE:        88 kB

UNSTABLE RESULTS

VmPeak:   146856 kB
VmSize:   146856 kB
VmHWM:    102152 kB
VmRSS:    102152 kB
RssAnon:           98080 kB
RssFile:            4072 kB
VmData:   136636 kB
VmPTE:       272 kB

IDLE-PEL RESULTS

VmPeak:   226728 kB
VmSize:   226728 kB
VmHWM:    155016 kB
VmRSS:    155016 kB
RssAnon:          151024 kB
RssFile:            3992 kB
VmData:   216508 kB
VmPTE:       376 kB

If this idea is considered viable then there is still more work to do on the PR to get it ready to be merged but I think the changes were enough to generate a valid benchmark.

guybe7 · 2020-10-29T16:33:04Z

@johncantrell97 thanks! i will go over the results and keep you posted

oranagra · 2020-10-29T19:17:29Z

@johncantrell97 thanks for looking into this.
it seems like some 50% overhead, which is more than i expected even for these short field+value pairs.
few random thoughts:

looking at the /proc metrics isn't valuable here, both vmsize and RSS count things that are not really relevant. i would have suggested to use MEMORY USAGE on that key, but that's just an approximation, and since you literally don't have anything else in the database, then used_memory from INFO MEMORY is what you actually wants.
i didn't read the whole correspondence, but i did discuss it with Guy, so i hope i'm not mistaken. IIUC we actually just need to sort them by time, and the only reason we add the entry id is to avoid collisions, and considering that the RAX tree divergence will happen in the idle time, the entry id part of the key is just sitting there for each and every entry. so an alternative would maybe be to add the pointer to the PEL record as the suffix of the rax key name (instead of the entry ID), it serves the same purpose (still being unique) and it's a bit shorter (8 bytes instead of 16). what can be neat about that is that now we can store NULL as the value, and get the PEL pointer from the key name (no need for bigendian btw), and there's a neat feature of rax, which can avoid wasting memory on the value if it is NULL (using one bit in the rax tree record).

p.s. converting this PR to draft (which is true), in order to hide the test failures due to build warnings. when it's time to be tested, feel free to convert back.

johncantrell97 · 2020-10-29T19:28:48Z

Thanks for the feedback! I wasn't really sure the best way to measure memory usage here. I was also surprised by the overhead, doesn't the fact that I used short field+value pairs not really matter because the data isn't duplicated just the reference to the entry?

I'll re-run these with the metrics from redis directly using the methods you suggested.
That is correct, that we just want entries sorted by time. Wow, that sounds like a really cool optimization. I'll give it a shot!

Thanks, the test failures from the warning is just because I have unused variables in some unfinished code where we handle failures for insertion. Wasn't going to bother thinking through it if the idea is thrown away.

oranagra · 2020-10-29T19:47:19Z

yeah, just a draft sloppy coding to test an idead.. i see the errors are til here though (bothering my eyes).

regarding the lengths of the fields and values, we're measuring % increase of memory usage (gotta compare it to something), so if you had more fields, or longer values, the % increase will be lower.

johncantrell97 · 2020-10-29T21:41:27Z

So I tried to make the optimization you mentioned though I'm not a C programmer so I'm not sure it's exactly how to do it. It definitely doesn't compile on 32-bit machines because I'm trying to cast the pointer to a uint64 so I'd have to fix that code if it's even the right approach.

Regardless, the key is now only 128 bits total and the value is NULL and the size increase is still large for some reason. Using INFO MEMORY now though:

used_memory:183852184
used_memory_human:175.34M
used_memory_rss:195653632
used_memory_rss_human:186.59M
used_memory_peak:183977032
used_memory_peak_human:175.45M
used_memory_peak_perc:99.93%
used_memory_overhead:545768
used_memory_startup:525192
used_memory_dataset:183306416
used_memory_dataset_perc:99.99%
allocator_allocated:183962640
allocator_active:184225792
allocator_resident:191213568
total_system_memory:53840658432
total_system_memory_human:50.14G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.00
allocator_frag_bytes:263152
allocator_rss_ratio:1.04
allocator_rss_bytes:6987776
rss_overhead_ratio:1.02
rss_overhead_bytes:4440064
mem_fragmentation_ratio:1.06
mem_fragmentation_bytes:11842464
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:20504
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

used_memory:94070592
used_memory_human:89.71M
used_memory_rss:104472576
used_memory_rss_human:99.63M
used_memory_peak:94195440
used_memory_peak_human:89.83M
used_memory_peak_perc:99.87%
used_memory_overhead:544744
used_memory_startup:524168
used_memory_dataset:93525848
used_memory_dataset_perc:99.98%
allocator_allocated:94177512
allocator_active:94474240
allocator_resident:100143104
total_system_memory:53840658432
total_system_memory_human:50.14G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.00
allocator_frag_bytes:296728
allocator_rss_ratio:1.06
allocator_rss_bytes:5668864
rss_overhead_ratio:1.04
rss_overhead_bytes:4329472
mem_fragmentation_ratio:1.11
mem_fragmentation_bytes:10443000
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:20504
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

oranagra · 2020-11-01T11:24:19Z

@johncantrell97 i'm not sure what test the measurements you posted refer to.

i see in your initial test, we had 102mb RSS with unstable, and 155mb with the modified version (which had 24 bytes key and 8 byte value).
and the new tests you posted two tests:
one with 89mb used_meomry and 99mb RSS (is that unstable)
and one with 175mb used memory and 186mb RSS (i assume that's the modified code).
which means the modification i suggested (which reduced the bytes we keep, actually caused an increase of memory usage?)

Looking at the code it seems right (16 bytes key and 0 bytes value), one explanation for it might be that the rax tree is more diverged now, and the only way for that to happen is if the bytes at the beginning of the key (the delivery_time) are all very similar to each other so the tree isn't diverging for that part of the key, and when using the stream ID they where also similar to each other, but now with the pointer they diverge more.

If that's the case, the first question we should be asking ourselves is if this test is representative (so many records with very similar delivery_time)?

maybe we need to scale down this test and use raxRecursiveShow to try to realize what's really going inside the rax.
if we plan it carefully, i have a feeling we can reduce the extra cost to be rather small.

another theory is that maybe we have a bug and it's undetected because we didn't assert to check the return value of raxRemove? (not sure if the test you're doing is reaching that code)

regarding 32bit, don't worry about it for now, we'll solve it later, and also avoid storing the extra 4 bytes in that case.

guybe7 · 2020-11-02T11:48:14Z

there are two issues with the "pointer-in-the-key" approach:

if the rax is defragged we have tom go over the idle-pal and update the pointers
pointers tend to be somewhat random, so we don't enjoy the "clustering" rax provides for keys with identical prefix (this could explain why this approach was actually more memory-consuming even though the key is shorter)

we have two options:

we go with the original idea (rax key is delivery_time + streamID) -0 maybe when the stream is a bit more realistic (i.e. key of 32bytes, value of 128byte) the overhead is not so bad? @johncantrell97 could you please run your test with this configuration?)
we could eliminate delivery_time from the NACL structure. given a streamID we can search for it in the idle-pal to get the delivery time (scan with start=0|, end=MAXINT|). this might reduce the overhead caused by the idle-pal but will result in extra searching time (e.g. for XPENDING).

oranagra · 2020-11-02T13:00:03Z

@guybe7 a few points from our discussion that you forgot to mention:

the two problems you mentioned above (random pointers, and defrag) are not the main problem (defrag can be easily resolved, and the pointers will be more similar if we store them in bigendian).
the main problem with the approach i suggested is that streamNACK doesn't have the stream ID in it.
so if we find a record in the delivery time sorted RAX, and we don't store the stream ID in the RAX key, we have no way to look it up in the other RAX.

regarding further testing, apart from short strings, what affects the results of this test greatly is that the XADD and XREADGROUP where both done very fast, so the stream IDs and delivery times are all very very similar.
maybe a more realistic tests in that regard will show lead to different conclusions.

CLAassistant · 2024-03-24T23:20:41Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

oranagra marked this pull request as draft October 29, 2020 19:18

add idle rax

f472009

johncantrell97 force-pushed the idle-pel branch from 3aa8fde to f472009 Compare October 29, 2020 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add idle rax (with benchmarks) #7990

[WIP] add idle rax (with benchmarks) #7990

johncantrell97 commented Oct 29, 2020 •

edited

guybe7 commented Oct 29, 2020

oranagra commented Oct 29, 2020

johncantrell97 commented Oct 29, 2020

oranagra commented Oct 29, 2020

johncantrell97 commented Oct 29, 2020

oranagra commented Nov 1, 2020

guybe7 commented Nov 2, 2020

oranagra commented Nov 2, 2020

CLAassistant commented Mar 24, 2024

[WIP] add idle rax (with benchmarks) #7990

Are you sure you want to change the base?

[WIP] add idle rax (with benchmarks) #7990

Conversation

johncantrell97 commented Oct 29, 2020 • edited

guybe7 commented Oct 29, 2020

oranagra commented Oct 29, 2020

johncantrell97 commented Oct 29, 2020

oranagra commented Oct 29, 2020

johncantrell97 commented Oct 29, 2020

oranagra commented Nov 1, 2020

guybe7 commented Nov 2, 2020

oranagra commented Nov 2, 2020

CLAassistant commented Mar 24, 2024

johncantrell97 commented Oct 29, 2020 •

edited