Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sds size classes - memory optimization #2509

Closed
wants to merge 1 commit into from
Closed

sds size classes - memory optimization #2509

wants to merge 1 commit into from

Conversation

oranagra
Copy link
Member

@oranagra oranagra commented Apr 9, 2015

sds strings were using 8 byte header, this has two implications:

  1. wasting memory - even on short strings (of less than 256 chars) we have a full 4 bytes len field.
  2. we are not able to hold strings that are larger than 2^31 chars. (see issues with MIGRAGE Remove 512 MB max value limit #757)

With this patch, each sds instance can be one of 4 different internal types:

  • strings with 8bit length - 3 bytes header
  • strings with 16bit length - 5 bytes header
  • strings with 32bit length - 9 bytes header
  • strings with 64bit length - 17 bytes header

Test (on relatively short strings)
debug populate 10000000
used_memory of original code: 1254709872
used_memory with new code: 1078723024
memory optimization: 16%

*Notice that considering internal fragmentation, there are some string lengths which will now fit into a smaller jemalloc bin, for these the result will be more 2x factor of memory reduction.

@@ -119,7 +119,7 @@ REDIS_SERVER_NAME=redis-server
REDIS_SENTINEL_NAME=redis-sentinel
REDIS_SERVER_OBJ=adlist.o quicklist.o ae.o anet.o dict.o redis.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o crc16.o endianconv.o slowlog.o scripting.o bio.o rio.o rand.o memtest.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o redis-check-rdb.o
REDIS_CLI_NAME=redis-cli
REDIS_CLI_OBJ=anet.o sds.o adlist.o redis-cli.o zmalloc.o release.o anet.o ae.o crc64.o
REDIS_CLI_OBJ=anet.o adlist.o redis-cli.o zmalloc.o release.o anet.o ae.o crc64.o
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis-cli uses hiredis (which comes with its own implementation of sds), no need for sds.o in redis-cli.

@antirez
Copy link
Contributor

antirez commented Apr 9, 2015

I'll have to review this carefully for implications, but I'm already a fan of this pull request, because, it does not only provide memory benefits, it actually fixes the problem of being bound to 4 GB length in SDS string, which was a problem with Redis Cluster when there is to serialize huge keys for MIGRATE. However why we are using a 3/5/9/17 setup here? We could go for using the first two bits to store the class and have 2/4/8/16 classes. Note that this can be achieved easily using bitfields in a structure, with unions for each class.

@oranagra
Copy link
Member Author

oranagra commented Apr 9, 2015

Hi,
using bitfields will be problematic since you'll need a full 64 bit integer that is not aligned and that will cause many >> and | and & to happen (not efficient).
also i think C support for bitfields will not compile that, and quite a lot of manual ugly code will be needed.

i thought of another approach, to use the 2 lowest bits in the actual sds pointer (since malloc always returns 8 bytes aligned address), but that would mean we need to change sds not to be a char_, and we'll you'll need to call a macro every time you wanna pass an sds to a function that takes char_ (e.g. printf). i figured you won't like it very much... 8-(
also reduction from 8 bytes header to 3 bytes, it already quite a lot.

one more thing, with this approach i have a spare 6 bits in the header which I can use for a ref-counter.
currently there's a mixup in redis between rojb and sds.
robj, which has lru and such, is mainly meant to hold objects (in dictionaries), but many places use an sds inside an rjob just so that it can be ref-counted.
this lays down the ground for using just sds (without the robj struct overhead).
also, since sds.c (like dict.c) is a mini standalone library that can be used in other projects, adding a ref-count feature with no cost, is nice.

regarding the actual implementation.
using the sds api hasn't changed (much), the implementation details of handling the different types of headers is kept inside sds.h and sds.c.
I do admit the code is somewhat less readable (ugly), but it should be as efficient as possible!

Last note: we use this code for quite some time now (with redis 2.8), and it has been tested quite a lot.

@antirez
Copy link
Contributor

antirez commented Apr 9, 2015

Yep there is a complexity tradeoff... Still, this 6 spare bits... argh. Well maybe there is a good way to use them.

Currently you are using a 1 byte for the encoding type I guess, and only 2 bits are used for the different encodings. So given that most SDS strings are never resized, and the smallest they are the less likely the get resized, we can use 3 bits instead. Of the encodings, one additional that you are not using, could use the remaining 5 bits to directly state the length, without any other fiend, nor free bytes field. If we need to resize, we directly promote to the next level (3 bytes header).

This way strings up to 31 bytes can use a single byte header, which sounds interesting.

Apart from that, there is potential problem with this patch in general. Is Redis expecting in some place that where the sds buffer starts we have an aligned address? The obvious candidate was bitops.c, at least here I put a dirty check for this condition apparently:

    /* Count initial bytes not aligned to 32 bit. */
    while((unsigned long)p & 3 && count) {
        bits += bitsinbyte[*p++];
        count--;
    }

However I would run your modified code in an environment where unaligned addresses produce segfaults. Maybe valgrind is also able to check this?

@itamarhaber
Copy link
Member

🌟

@oranagra
Copy link
Member Author

oranagra commented Apr 9, 2015

Nice idea about the fifth size class.
*Though then we loose the refcouny (which isn't used at the moment anyway).

Want me to implement and add to the pull request?

I've tested it with valgrind (inside redis 2.8) with the test suite.

Good idea to test with segfault on unaligned access. I think* Linux has a /proc option for that (at least on ARM it does).
*Not near a PC at the moment.

@antirez
Copy link
Contributor

antirez commented Apr 9, 2015

ACK on everything except better to wait for me to properly review the PR before implementing more, since currently I'm blabling here without I checked the actual data structures used at all ;-) Thanks again, news soon.

@ahuntcirruspath
Copy link

+1

@antirez
Copy link
Contributor

antirez commented Jun 29, 2015

Don't worry this is in line for 3.2 merge roadmap :-) Work will start in the next days after the GEO commands stuff.

@ptaoussanis
Copy link

Very happy to hear that! Super excited about the 3.2 roadmap btw, thanks so much for all your hard work Salvatore!

@antirez
Copy link
Contributor

antirez commented Jun 29, 2015

Thanks @ptaoussanis. This time we are also benefitting quite a lot from other coders work :-) We have @mattsta Geo stuff, @oranagra memory optimizations (this and another one), @tmm1 AOF safety PR #2574, and more.

@antirez
Copy link
Contributor

antirez commented Jul 14, 2015

I completed a first initial review, that is, I just read the code line by line.
First impressions:

  1. The code looks sane.
  2. I would remove the refcount thing and leave the not used bits for something else, maybe for an additional encoding for very small strings.
  3. I wonder if we ever handle sds->buf as something with special alignments, like for example inside bitops.c. I hope it's not the case but it is worth to check with an architecture that segfaults on unaligned access. Or maybe there is a way to tell the Linux kernel to don't trap unaligned accesses on x86 as well.

Now next step, I merged the code locally and I'm going to do some tests, a bit more reviews changing a few function names and adding a few macros which I think can improve code readability. I'll do only work around this PR until merge or until there is a major issue I find to report here. So far so good, and thanks!

@oranagra
Copy link
Member Author

I vote in favor of the tiny size class (SDS_TYPE_6). it will save one more byte in many many cases.
this one byte might in some cases be the breakeven point to move to a lower heap allocator pool and save 16 bytes of internal fragmentation (next pool size in jemalloc after 64 is 80).

i looked at the code in bitops.c and it looks safe.
you could test on ARM, but remember kill kernel fixups by doing:
sudo echo 5 > /proc/cpu/alignment
i don't see anything like that on x86 kernels

antirez added a commit that referenced this pull request Jul 15, 2015
This is an attempt to use the refcount feature of the sds.c fork
provided in the Pull Request #2509. A new type, SDS_TYPE_5 is introduced
having a one byte header with just the string length, without
information about the available additional length at the end of the
string (this means that sdsMakeRoomFor() will be required each time
we want to append something, since the string will always report to have
0 bytes available).

More work needed in order to avoid common SDS functions will pay the
cost of this type. For example both sdscatprintf() and sdscatfmt()
should try to upgrade to SDS_TYPE_8 ASAP when appending chars.
@antirez
Copy link
Contributor

antirez commented Jul 15, 2015

Hello my now colleague @oranagra :-) I tried to implement TYPE 5, your comments welcomed. Now testing the difference it makes. Of course this whole PR benefits a lot from the other one where there is a new size class for jemalloc...

@antirez
Copy link
Contributor

antirez commented Jul 15, 2015

Forgot to mention that, btw, even if some complexity is added, there are use cases with ~10 to ~20% memory benefits, so consider this merged... Just the time to see if this can be improved in some way.

@oranagra
Copy link
Member Author

I reviewed the code (see one comment in side the code).

i didn't realize (until now) that we don't have space for the 'alloc' field.
so that our options are either:

  1. 2 bites for length and 2 bits for alloc (which is useless).
  2. have sdsavail() report 0, and then the string is realloced every time it grows.
  3. use zmalloc_size() every time to get the true size of the allocation (which we don't store)

in fact, when thinking of it, the realloc will be a nop if the allocation doesn't jump from one allocator bin to another (it won't do allocate+memcpy+free), it will just return the same pointer.
so options 2 and 3 are in fact almost the same.

@antirez
Copy link
Contributor

antirez commented Jul 20, 2015

Hello @oranagra, thanks for spotting this bug, I was sure I had fixed it but in some way I lost the commit. However I added tests that can spot the bug if it happens again (by compiling with -DSDS_TEST_MAIN).

About the options we have for type 5, indeed it's kinda the same to use option 2 and 3. I vote for 2 mostly because it leave SDS allocator independent mostly, as it is today. In most normal code there will be not much speed penalty, however there is a common case I would like to optimize which is sdscatfmt(), when appending single characters that don't contain any format specifier, in that case, for each character sdsMakeRoomFor() will be called which is kinda unfortunate.

However I can't easily spot the difference while benchmarking.

Instead I can see there is a speed regression in "CLIENT LIST" (from 20k to 15k calls with a single client), which is not due to TYPE 5 (I commented it to check). So I'm investigating where the problem is even if it may be non trivial to understand.

@antirez
Copy link
Contributor

antirez commented Jul 22, 2015

This small benchmark can show the very big speed regression in the new sds.c implementation:

#include "sds.h"

int main(void) {
    int j, i;
    for (j = 0; j < 1000000; j++) {
        sds s = sdsempty();
        for (i = 0; i < 200; i++)
            s = sdscatlen(s, "abc", 3);
        sdsfree(s);
    }
}

Baseline is Redis unstable that took 3 seconds to run:

$ git co unstable
$ gcc sdsbench.c sds.c zmalloc.c -O2 -o /tmp/sds; time /tmp/sds
/tmp/sds  3.03s user 0.01s system 99% cpu 3.070 total

This PR commit takes 4.6 seconds:

$ git co f15df8b
$ gcc sdsbench.c sds.c zmalloc.c -O2 -o /tmp/sds; time /tmp/sds
/tmp/sds  4.57s user 0.01s system 99% cpu 4.605 total

Finally I made things worse introducing type 5:

$ git co sds
$ gcc sdsbench.c sds.c zmalloc.c -O2 -o /tmp/sds; time /tmp/sds
/tmp/sds  4.99s user 0.02s system 99% cpu 5.034 total

In most Redis commands concatenation is not a so stressed thing, but in general I think we should try to improve the performance before to merge this given how big the difference is. A slowdown of a few % points is ok in my opinion. There is to consider we don't get just the memory optimization here. We also get the ability of SDS strings to go over 4GB which is important for the future. However I think we can pay less prince for it since this workload should be all allocator time + memcpy, and it is strange we are getting problems so big here. Maybe it is unaligned access or something that changed in the cache access or alike.

@antirez
Copy link
Contributor

antirez commented Jul 22, 2015

Note: client output buffers and query buffers use SDS concatenation so this is why we see performance issues in otherwise unrelated tests I guess. Like LRANGE_500 or just SET with pipelining.

@oranagra
Copy link
Member Author

I've reviewed the code again (I wrote it quite some time ago, long before the PR).

I have 3 suspects :

  1. Code cache : with all that inlining the code became a lot larger, and may not be in the cache in some cases. However, in your sdscatlen benchmark, on modern machines, I'm sure it does. So let's put that aside.

  2. this comment :
    /* since the header size changes, need to move the string forward, and can't use reallocate /*
    In your sdscatlen benchmark, my patch issues quite a few more real-reallocs (ones that are not NOPs) which causes a lot of memcpying.

  3. jump prediction and pipeline flush penalty. I'm not sure yet, but maybe we'll have to add manual jump prediction hints. (which are not very portable, so they'll uglyfy the code).

Regarding the memory alignment, I'm sure it's not the case, at least not with that benchmark.

I can test theory #2 by hack of always reserving some space for bigger headers. It will also fix the alignment so if it works we'll have to run another test to tell them apart.

@oranagra
Copy link
Member Author

Update: it doesn't seem to be any of the 3 reasons i listed above.
i reached that conclusion by experimentation of modifying the code and commenting parts of it, then running your benchmark.
I ended up with a version that is written in the new form (macros and switch-cases), but supports only 32bit class and all switch cases are in comment.
although the performance did improve its still not close enough to the original performance IMO.
so the next step is to break up this simplified version's assembly and compare it to the original version so that i can understand what the compiler doesn't like here...

too tired for that now, going to sleep.... just wanted to update you since you're currently investigating the same thing..

@antirez
Copy link
Contributor

antirez commented Jul 22, 2015

Thank you @oranagra, this provides more state to understand what's going on. Tomorrow morning I'll continue doing a few more tests in the light of what you wrote in order to check if I can find something useful, and report it if I find anything.

@antirez
Copy link
Contributor

antirez commented Jul 23, 2015

A few findings:

  1. sdsMakeRoomFor() is no longer inlined in the new code since it's too big. Apparently this is part of the story.
  2. There are optimizations to win some time. From 5 seconds to 4 seconds in the new code in my test. The diff follows:
diff --git a/src/sds.c b/src/sds.c
index 8283ec4..4622694 100644
--- a/src/sds.c
+++ b/src/sds.c
@@ -80,6 +80,9 @@ sds sdsnewlen(const void *init, size_t initlen) {
     void *sh;
     sds s;
     char type = sdsReqType(initlen);
+#if 1
+    if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
+#endif
     int hdrlen = sdsHdrSize(type);
     unsigned char *fp; /* flags pointer. */
@@ -203,6 +210,9 @@ sds sdsMakeRoomFor(sds s, size_t addlen) {
         newlen += SDS_MAX_PREALLOC;

     type = sdsReqType(newlen);
+#if 1
+    if (type == SDS_TYPE_5) type = SDS_TYPE_8;
+#endif
     hdrlen = sdsHdrSize(type);
     if (oldtype==type) {
         newsh = zrealloc(sh, hdrlen+newlen+1);
@@ -363,8 +373,15 @@ sds sdsgrowzero(sds s, size_t len) {
 sds sdscatlen(sds s, const void *t, size_t len) {
     size_t curlen = sdslen(s);

+#if 1
+    if (sdsavail(s) < len) {
+        s = sdsMakeRoomFor(s,len);
+        if (s == NULL) return NULL;
+    }
+#else
     s = sdsMakeRoomFor(s,len);
     if (s == NULL) return NULL;
+#endif
     memcpy(s+curlen, t, len);
     sdssetlen(s, curlen+len);
     s[curlen+len] = '\0';

The ideas in the above diff:

  1. Don't use TYPE 5 if strings are going to be reallocated, since it sucks not having a free space left field.
  2. Don't call sdsMakeRoomFor() when obviously not needed.

@antirez
Copy link
Contributor

antirez commented Jul 24, 2015

This PR was just merged, thanks @oranagra. We verified that the speed hint is not very significative in most commands, with the exception of commands having a workload depending a lot of sdscatlen() or sdcatfmt(), where it is possible to use up to 25% of performances. However:

  1. We only identified CLIENT LIST so far having such a big loss.
  2. The other commands in general show a very small performance hit that sometimes could be compensated by the fact the memory footprint of the whole Redis instance is smaller.
  3. The commit actually adds an important functionality to SDS: the ability to don't crash when the user appends over 4GB, which was a concern in the past and is going to be a concern even more in the future as computers have more memory.
  4. The commit is trivial to revert if we'll change idea later for some reason, since it is very local to sds.c and does not affect the exported API.

@antirez antirez closed this Jul 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants