-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?
Just update the server config and SET/GET results in top comment. |
@valkey-io/core-team, could you help to take a look at this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did not take a deep look, how do we handle module threads?
Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.
Sorry I missed the modules thread, maybe I can remove the explicit call of the |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #308 +/- ##
============================================
- Coverage 70.22% 70.19% -0.03%
============================================
Files 110 110
Lines 60039 60066 +27
============================================
+ Hits 42163 42165 +2
- Misses 17876 17901 +25
|
@enjoy-binbin @PingXie @madolson Thoughts about this patch? |
Kindly ping @valkey-io/core-team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems almost ready to merge.
@PingXie @enjoy-binbin @madolson are there any open questions and suggestions how we can close them?
In the Valkey Contributor Summit, someone (who?) suggested that we remove this counter in zmalloc and instead rely on the stats from the allocator, like The INFO fields are |
I think there are still 2 comments to be addressed
@lipzhu you mentioned that |
@zuiderkwast I am not clear the full history of
But I noticed that the metrics of
@PingXie Yes, I am also a little confused here, according to the suggestion #308 (comment), but how do we handle the modules, is there any way to get the modules number to be a factor of max thread number?
Can you be more specific.
I am OK with any proposal if your core-team make the final decision.
Each call of |
Thanks for putting some focus on this performance bottleneck! I commented in the issue #467. I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used. |
Sorry I meant to ask if you had a chance to evaluate the overhead of
I think we should get a perf reading of this change first.
+1. Appreciate it, @lipzhu! |
@zuiderkwast @PingXie Update the patch as suggested in #467. |
…n when update used_memory metrics. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (with some nit)
It's almost the code I posted in a comment. :) It was more fun to write code than to explain in text. Sorry for that.
It seems we have a problem on macos:
See https://stackoverflow.com/questions/16244153/clang-c11-threads-h-not-found Maybe we need some conditional compilation like #if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif |
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There's a "major-decision-pending" label on the issue, so let's wait for a majority of the @valkey-io/core-team (vote or approve) before we merge this.
I have some concerns:
|
@soloestoy Isn't checking time in the allocator function too slow? It is a very hot code path. Some alternative ideas:
|
yes, it would be best if every thread can register a timer to trigger updating delta every second.
maybe 10KB is an acceptable error.
native threads can work well, but module's thread is a problem. |
@soloestoy How to do this per thread in a simple way? Signal handler?
Let's do this then? I think it can provide much of the benefit already.
Simple solution: We can add do explicit flush to the atomic counter in ValkeyModule_Alloc and ValkeyModule_Free. |
@soloestoy The PR is based on the assumption that we can accept If we are not alignment of the absolute |
Description
This patch try to introduce Thread-local storage variable to replace atomic for
zmalloc
to reduce unnecessary contention.Problem Statement
zmalloc
andzfree
related functions will update theused_memory
for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio ofzmalloc
andzfree
are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed bylock add
andlock sub
, they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.Performance Impact
Test Env
Start Server
taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf
By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml.
memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram
We can observe more than 6% QPS gain.
For the other benchmarks SET/GET, using the commands like:
taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4
No perf gain and regression.
With pipeline enabled, I can observe 4% perf gain with test case.
taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram