-
Notifications
You must be signed in to change notification settings - Fork 23.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CPU cache efficiency on dict while it's being rehashed #5692
Conversation
I also have the same doubts. Do you have new thoughts? Can you explain it to me. Thanks |
we don't need checking ht[0] if index less then rehashidx .Because all elements less then rehashidx have been moving to ht[1] |
perfect!!! I agree with you。 What does the redis author think? Why he is not merge this request? |
I don't know. Maybe it's not urgent. |
@erpeng Thanks. This optimization is nice but I wonder how much there is to gain here? Immediately after computing the index we lookup in I suggest closing this, but let me know if you think otherwise. |
I think It's not only about performance.With this change,it's simple and more understandable that we're considering the rehash.While |
@erpeng 我就用中文说了哈,有一种情况就是在dictAddRaw中达到了rehash的条件这个时候,rehashidx会被设置为0,而这个时候会添加的key会添加到table[1]中,这个时候如果有人去找这个key,发现rehashidx是0,小于这个idx_0就table[0]找,这时候是找不到的 |
@ohye3166 You'd better use English to express your question. |
yes you are right,my mistake |
Co-authored-by: sundb <sundbcn@gmail.com>
I have refresh this PR.Can you explain the cache miss situation? I can't quite understand this |
Results of testing with unstable
this pr
|
How to run this test? |
@erpeng |
@sundb I ran the benchmark myself, and I got different results, here are my numbers (running with 50 million requests instead of the 5 million for accurate on an ARM instance (maybe it's ARM?)) Current code
With this PR
My concern with running with 5 million is that too much of the data might be in CPU cache, and so it might be unrepresentative of real workloads. |
@madolson Use macOS to run dict unittest again. Unstable:
With this PR
|
@erpeng following a pointer to look up it's value, or accessing an array at an offset is usually access to memory that wasn't recently accessed, and it not inside the CPU cache, which will result in real memory access (much slower than running a few instructions on stack based variables that are likely to be cached. p.s. i don't mind merging this even if if we can't measure the difference (since the change is just adding one line, and not refactoring anything), but what's odd is that some results show degradation, which doesn't make sense. |
@sundb I got similar values again on MacOS, but I generally don't trust mac since it can be weird with libMalloc and since it's laptop, has cores that have a variable clock speed. As far as I can tell, there should be no performance change when doing the random key removal, since nothing around that code base has changed. It could be related to this, https://www.youtube.com/watch?v=r-TLSBdHe1A&t=464s. If we are somehow messing with the branch predictor (which is probably important here) we might be getting a lot of CPU stalls. We probably need better tooling around determining if something is actually improving performance. |
For anyone running the dict benchmarks, I think we should be focusing, in the context of this PR on performance during rehashing. The benchmarks currently wait for rehashing to complete, but this optimization should only be significant while it's running. It might be interesting to add a rehashing benchmark that will somehow do the same tests while rehashing is running. |
i'll go ahead and close 9898 since we concluded this one is better, but i wanna state again that the fact we weren't able to measure any performance gain, was possibly because our benchmark didn't really hit the scenario which we optimized. I think that the fact we have a long term goal to completely replace the dict, shouldn't conflict with small incremental improvements. |
@oranagra thank you for your hint, I'm happy to try it |
Now I've tested this pr, the results are similar with @madolson @sundb , no improvement or deterioration in performance could be observed. But through the perf tool, I observed this pr had the less DRAM access. My test environment was on a bare metal machine with 96 cores, 512GB Dram, the cpu model is I devise two different benchmark, the results are consistent.
int bitsize = 0;
while ((count = count / 2) != 0) {
bitsize++;
}
dictEmpty(dict, NULL);
count = 1L << bitsize;
//Inserting 2^N+1 keys triggers the rehash, ensuring that table[1] exists
for (j = 0; j < count + 1; j++) {
int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
assert(ret == DICT_OK);
}
char key[21];
start_benchmark();
for (j = 0; j < count; j++) {
//Using ll2string reduces the memory operations required to observe rehash performance changes more accurately
ll2string(key, 21, j);
key[0] = 'X';
dictEntry *de = dictFind(dict, key);
assert(de == NULL);
}
end_benchmark("Accessing missing when rehashing"); I also compared the assembler code and found that the gcc7.5 O2 level optimization does not affect the function with the new branch judgment. I then looked at the cache misses and accesses using perf, and it was clear that this pr was doing fewer fetches and misses to DRAM. $ sudo perf stat -e cache-misses -e cache-references ./redis-server-unstable test dict --accurate
Performance counter stats for './redis-server-unstable test dict --accurate':
196,515,412 cache-misses # 49.121 % of all cache refs
400,063,631 cache-references
$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehashidx test dict --accurate
Performance counter stats for './redis-server-rehashidx test dict --accurate':
187,646,220 cache-misses # 47.933 % of all cache refs
391,479,924 cache-references My guess is that this is due to the instruction-level concurrency and the pre-fetching for DRAM in modern cpus, where table[0] is always fetched concurrently with table[1], so reducing table[0] accesses won't impovement performance. |
thanks @judeng, i agree we should still merge it.
|
@oranagra I get your concerns and update the code, also comment out the others test cases in the dict.c. The results are exactly same as before. Please feel free to add your thoughts.
int bitsize = 0;
while ((count = count / 2) != 0) {
bitsize++;
}
dictEmpty(dict, NULL);
count = 1L << bitsize;
for (j = 0; j < count; j++) {
int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
assert(ret == DICT_OK);
}
dictExpand(dict, count*2);
// Make rehashidx large enough so that most keys have idx smaller than rehashidx
while(1) {
dictRehash(dict, 100);
if(dict->rehashidx > (count-1000)) break;
}
printf("dict->rehashidx:%ld\n", dict->rehashidx);
dictPauseRehashing(dict);
char key[21];
start_benchmark();
for (j = 0; j < count; j++) {
ll2string(key, 21, j);
key[0] = 'X';
dictEntry *de = dictFind(dict, key);
assert(de == NULL);
}
end_benchmark("Accessing missing when rehashing");
[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server test dict --accurate
dict->rehashidx:4193368
Accessing missing when rehashing: 4194304 items in 990 ms
Performance counter stats for './redis-server test dict --accurate':
37,598,655 cache-misses # 42.897 % of all cache refs
87,647,805 cache-references
5.781412267 seconds time elapsed
[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehash test dict --accurate
dict->rehashidx:4193368
Accessing missing when rehashing: 4194304 items in 912 ms
Performance counter stats for './redis-server-rehash test dict --accurate':
36,063,218 cache-misses # 44.266 % of all cache refs
81,470,018 cache-references
5.719584564 seconds time elapsed
|
Aren't we missing a |
@oranagra Maybe I missed something. Is the following code what you want? It avoids the overhead of inserting the key. dictEmpty(dict, NULL);
dixeExpand(dict, count);
while(dictIsRehashing(dict)) {
dictRehash(dict, 100);
}
dictExpand(dict, count*2);
// Make rehashidx large enough so that most keys have idx smaller than rehashidx
while(1) {
dictRehash(dict, 100);
if(dict->rehashidx > DICTHT_SIZE(d->ht_size_exp[0])-2000) break;
}
printf("dict->rehashidx:%ld\n", dict->rehashidx);
dictPauseRehashing(dict);
char key[21];
start_benchmark();
for (j = 0; j < count; j++) {
ll2string(key, 21, j);
key[0] = 'X';
dictEntry *de = dictFind(dict, key);
assert(de == NULL);
}
end_benchmark("Accessing missing when rehashing"); |
In your above example the dict is empty. |
@oranagra yes, you are right, we do need to fill in some keys. I wanted to minimize unnecessary memory operations for this test to see the impact of this pr on the cpu cache, so I only filled keys in 1/100 of the buckets of dict. The results are even clearer, with dram accesses nearly halved ( dictEmpty(dict, NULL);
dictExpand(dict, count);
while(dictIsRehashing(dict)) {
dictRehash(dict, 100);
}
for (j = 0; j < count / 100; j++) {
int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
assert(ret == DICT_OK);
}
dictExpand(dict, count*2);
printf("dict->rehashidx:%ld, dict size:%lu\n", dict->rehashidx, DICTHT_SIZE(dict->ht_size_exp[0]));
// Make rehashidx large enough so that most keys have idx smaller than rehashidx
while(1) {
dictRehash(dict, 1);
if (dict->ht_used[0] < 10)
break;
}
printf("dict->rehashidx:%ld\n", dict->rehashidx);
dictPauseRehashing(dict);
char key[21];
start_benchmark();
for (j = 0; j < count; j++) {
ll2string(key, 21, j);
key[0] = 'X';
dictEntry *de = dictFind(dict, key);
assert(de == NULL);
}
end_benchmark("Accessing missing when rehashing"); results: [judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-unstable test dict 50000000
dict->rehashidx:0, dict size:67108864
dict->rehashidx:67107237
Accessing missing when rehashing: 50000000 items in 9167 ms
Performance counter stats for './redis-server-unstable test dict 50000000':
126,969,332 cache-misses # 56.962 % of all cache refs
222,900,692 cache-references
11.311874814 seconds time elapsed
[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehash test dict 50000000
dict->rehashidx:0, dict size:67108864
dict->rehashidx:67107237
Accessing missing when rehashing: 50000000 items in 8711 ms
Performance counter stats for './redis-server-rehash test dict 50000000':
80,324,778 cache-misses # 62.366 % of all cache refs
128,795,881 cache-references
10.919625194 seconds time elapsed
|
great.. that's conclusive proof that it's an improvement! |
when find a key ,if redis is rehashing, currently we should lookup both tables (ht[0] and ht[1]).
if we use the key's index comparing to the rehashidx,if index < rehashidx,then we can conclude:
The possible performance gain here, is not the looping over the linked list (which is empty),
but rather the lookup in the table (which could be a cache miss).
benchmark: #5692 (comment)