-
Notifications
You must be signed in to change notification settings - Fork 23.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sds size classes - memory optimization #2509
Conversation
@@ -119,7 +119,7 @@ REDIS_SERVER_NAME=redis-server | |||
REDIS_SENTINEL_NAME=redis-sentinel | |||
REDIS_SERVER_OBJ=adlist.o quicklist.o ae.o anet.o dict.o redis.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o crc16.o endianconv.o slowlog.o scripting.o bio.o rio.o rand.o memtest.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o redis-check-rdb.o | |||
REDIS_CLI_NAME=redis-cli | |||
REDIS_CLI_OBJ=anet.o sds.o adlist.o redis-cli.o zmalloc.o release.o anet.o ae.o crc64.o | |||
REDIS_CLI_OBJ=anet.o adlist.o redis-cli.o zmalloc.o release.o anet.o ae.o crc64.o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redis-cli uses hiredis (which comes with its own implementation of sds), no need for sds.o in redis-cli.
I'll have to review this carefully for implications, but I'm already a fan of this pull request, because, it does not only provide memory benefits, it actually fixes the problem of being bound to 4 GB length in SDS string, which was a problem with Redis Cluster when there is to serialize huge keys for MIGRATE. However why we are using a 3/5/9/17 setup here? We could go for using the first two bits to store the class and have 2/4/8/16 classes. Note that this can be achieved easily using bitfields in a structure, with unions for each class. |
Hi, i thought of another approach, to use the 2 lowest bits in the actual sds pointer (since malloc always returns 8 bytes aligned address), but that would mean we need to change sds not to be a char_, and we'll you'll need to call a macro every time you wanna pass an sds to a function that takes char_ (e.g. printf). i figured you won't like it very much... 8-( one more thing, with this approach i have a spare 6 bits in the header which I can use for a ref-counter. regarding the actual implementation. Last note: we use this code for quite some time now (with redis 2.8), and it has been tested quite a lot. |
Yep there is a complexity tradeoff... Still, this 6 spare bits... argh. Well maybe there is a good way to use them. Currently you are using a 1 byte for the encoding type I guess, and only 2 bits are used for the different encodings. So given that most SDS strings are never resized, and the smallest they are the less likely the get resized, we can use 3 bits instead. Of the encodings, one additional that you are not using, could use the remaining 5 bits to directly state the length, without any other fiend, nor This way strings up to 31 bytes can use a single byte header, which sounds interesting. Apart from that, there is potential problem with this patch in general. Is Redis expecting in some place that where the sds buffer starts we have an aligned address? The obvious candidate was bitops.c, at least here I put a dirty check for this condition apparently:
However I would run your modified code in an environment where unaligned addresses produce segfaults. Maybe valgrind is also able to check this? |
🌟 |
Nice idea about the fifth size class. Want me to implement and add to the pull request? I've tested it with valgrind (inside redis 2.8) with the test suite. Good idea to test with segfault on unaligned access. I think* Linux has a /proc option for that (at least on ARM it does). |
ACK on everything except better to wait for me to properly review the PR before implementing more, since currently I'm blabling here without I checked the actual data structures used at all ;-) Thanks again, news soon. |
+1 |
Don't worry this is in line for 3.2 merge roadmap :-) Work will start in the next days after the GEO commands stuff. |
Very happy to hear that! Super excited about the 3.2 roadmap btw, thanks so much for all your hard work Salvatore! |
Thanks @ptaoussanis. This time we are also benefitting quite a lot from other coders work :-) We have @mattsta Geo stuff, @oranagra memory optimizations (this and another one), @tmm1 AOF safety PR #2574, and more. |
I completed a first initial review, that is, I just read the code line by line.
Now next step, I merged the code locally and I'm going to do some tests, a bit more reviews changing a few function names and adding a few macros which I think can improve code readability. I'll do only work around this PR until merge or until there is a major issue I find to report here. So far so good, and thanks! |
I vote in favor of the tiny size class (SDS_TYPE_6). it will save one more byte in many many cases. i looked at the code in bitops.c and it looks safe. |
This is an attempt to use the refcount feature of the sds.c fork provided in the Pull Request #2509. A new type, SDS_TYPE_5 is introduced having a one byte header with just the string length, without information about the available additional length at the end of the string (this means that sdsMakeRoomFor() will be required each time we want to append something, since the string will always report to have 0 bytes available). More work needed in order to avoid common SDS functions will pay the cost of this type. For example both sdscatprintf() and sdscatfmt() should try to upgrade to SDS_TYPE_8 ASAP when appending chars.
Hello my now colleague @oranagra :-) I tried to implement TYPE 5, your comments welcomed. Now testing the difference it makes. Of course this whole PR benefits a lot from the other one where there is a new size class for jemalloc... |
Forgot to mention that, btw, even if some complexity is added, there are use cases with ~10 to ~20% memory benefits, so consider this merged... Just the time to see if this can be improved in some way. |
I reviewed the code (see one comment in side the code). i didn't realize (until now) that we don't have space for the 'alloc' field.
in fact, when thinking of it, the realloc will be a nop if the allocation doesn't jump from one allocator bin to another (it won't do allocate+memcpy+free), it will just return the same pointer. |
Hello @oranagra, thanks for spotting this bug, I was sure I had fixed it but in some way I lost the commit. However I added tests that can spot the bug if it happens again (by compiling with About the options we have for type 5, indeed it's kinda the same to use option 2 and 3. I vote for 2 mostly because it leave SDS allocator independent mostly, as it is today. In most normal code there will be not much speed penalty, however there is a common case I would like to optimize which is sdscatfmt(), when appending single characters that don't contain any format specifier, in that case, for each character sdsMakeRoomFor() will be called which is kinda unfortunate. However I can't easily spot the difference while benchmarking. Instead I can see there is a speed regression in "CLIENT LIST" (from 20k to 15k calls with a single client), which is not due to TYPE 5 (I commented it to check). So I'm investigating where the problem is even if it may be non trivial to understand. |
This small benchmark can show the very big speed regression in the new sds.c implementation: #include "sds.h"
int main(void) {
int j, i;
for (j = 0; j < 1000000; j++) {
sds s = sdsempty();
for (i = 0; i < 200; i++)
s = sdscatlen(s, "abc", 3);
sdsfree(s);
}
} Baseline is Redis unstable that took 3 seconds to run:
This PR commit takes 4.6 seconds:
Finally I made things worse introducing type 5:
In most Redis commands concatenation is not a so stressed thing, but in general I think we should try to improve the performance before to merge this given how big the difference is. A slowdown of a few |
Note: client output buffers and query buffers use SDS concatenation so this is why we see performance issues in otherwise unrelated tests I guess. Like LRANGE_500 or just SET with pipelining. |
I've reviewed the code again (I wrote it quite some time ago, long before the PR). I have 3 suspects :
Regarding the memory alignment, I'm sure it's not the case, at least not with that benchmark. I can test theory #2 by hack of always reserving some space for bigger headers. It will also fix the alignment so if it works we'll have to run another test to tell them apart. |
Update: it doesn't seem to be any of the 3 reasons i listed above. too tired for that now, going to sleep.... just wanted to update you since you're currently investigating the same thing.. |
Thank you @oranagra, this provides more state to understand what's going on. Tomorrow morning I'll continue doing a few more tests in the light of what you wrote in order to check if I can find something useful, and report it if I find anything. |
A few findings:
diff --git a/src/sds.c b/src/sds.c
index 8283ec4..4622694 100644
--- a/src/sds.c
+++ b/src/sds.c
@@ -80,6 +80,9 @@ sds sdsnewlen(const void *init, size_t initlen) {
void *sh;
sds s;
char type = sdsReqType(initlen);
+#if 1
+ if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
+#endif
int hdrlen = sdsHdrSize(type);
unsigned char *fp; /* flags pointer. */
@@ -203,6 +210,9 @@ sds sdsMakeRoomFor(sds s, size_t addlen) {
newlen += SDS_MAX_PREALLOC;
type = sdsReqType(newlen);
+#if 1
+ if (type == SDS_TYPE_5) type = SDS_TYPE_8;
+#endif
hdrlen = sdsHdrSize(type);
if (oldtype==type) {
newsh = zrealloc(sh, hdrlen+newlen+1);
@@ -363,8 +373,15 @@ sds sdsgrowzero(sds s, size_t len) {
sds sdscatlen(sds s, const void *t, size_t len) {
size_t curlen = sdslen(s);
+#if 1
+ if (sdsavail(s) < len) {
+ s = sdsMakeRoomFor(s,len);
+ if (s == NULL) return NULL;
+ }
+#else
s = sdsMakeRoomFor(s,len);
if (s == NULL) return NULL;
+#endif
memcpy(s+curlen, t, len);
sdssetlen(s, curlen+len);
s[curlen+len] = '\0'; The ideas in the above diff:
|
This PR was just merged, thanks @oranagra. We verified that the speed hint is not very significative in most commands, with the exception of commands having a workload depending a lot of sdscatlen() or sdcatfmt(), where it is possible to use up to 25% of performances. However:
|
sds strings were using 8 byte header, this has two implications:
With this patch, each sds instance can be one of 4 different internal types:
Test (on relatively short strings)
debug populate 10000000
used_memory of original code: 1254709872
used_memory with new code: 1078723024
memory optimization: 16%
*Notice that considering internal fragmentation, there are some string lengths which will now fit into a smaller jemalloc bin, for these the result will be more 2x factor of memory reduction.