Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

Closed
wants to merge 3 commits into from

Conversation

lipzhu
Copy link
Contributor

@lipzhu lipzhu commented Apr 12, 2024

Description

This patch try to introduce Thread-local storage variable to replace atomic for zmalloc to reduce unnecessary contention.

Problem Statement

zmalloc and zfree related functions will update the used_memory for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio of zmalloc and zfree are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed by lock add and lock sub , they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.

image

image

Performance Impact

Test Env
  • OS: CentOS Stream 8
  • Kernel: 6.2.0
  • Platform: Intel Xeon Platinum 8380
  • Server and Client in same socket
Start Server

taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml.
memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

We can observe more than 6% QPS gain.

For the other benchmarks SET/GET, using the commands like:
taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4

No perf gain and regression.

With pipeline enabled, I can observe 4% perf gain with test case.
taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

@lipzhu lipzhu changed the title Introduce TLS variable to replace atomic when update used_memory metrics to reduce contention. Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. Apr 12, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 15, 2024

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

Just update the server config and SET/GET results in top comment.

src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Outdated Show resolved Hide resolved
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 17, 2024

@valkey-io/core-team, could you help to take a look at this patch?

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did not take a deep look, how do we handle module threads?

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 22, 2024

Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.

i did not take a deep look, how do we handle module threads?

Sorry I missed the modules thread, maybe I can remove the explicit call of the zmalloc_register_thread_index in start routine, initialize static __thread int thread_index = -1 and then add a checker if (thread_index == -1) zmalloc_register_thread_index() in zmalloc/zfree?

Copy link

codecov bot commented Apr 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.19%. Comparing base (b546dd2) to head (aeb8159).
Report is 12 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #308      +/-   ##
============================================
- Coverage     70.22%   70.19%   -0.03%     
============================================
  Files           110      110              
  Lines         60039    60066      +27     
============================================
+ Hits          42163    42165       +2     
- Misses        17876    17901      +25     
Files Coverage Δ
src/evict.c 97.34% <100.00%> (ø)
src/zmalloc.c 85.44% <100.00%> (+0.80%) ⬆️

... and 14 files with indirect coverage changes

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 28, 2024

@enjoy-binbin @PingXie @madolson Thoughts about this patch?

@lipzhu
Copy link
Contributor Author

lipzhu commented May 31, 2024

Kindly ping @valkey-io/core-team.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems almost ready to merge.

@PingXie @enjoy-binbin @madolson are there any open questions and suggestions how we can close them?

src/zmalloc.c Outdated Show resolved Hide resolved
@zuiderkwast zuiderkwast added the to-be-merged Almost ready to merge label Jun 13, 2024
@zuiderkwast
Copy link
Contributor

In the Valkey Contributor Summit, someone (who?) suggested that we remove this counter in zmalloc and instead rely on the stats from the allocator, like je_mallctl("stats.allocated"). I tend to agree with that. zmalloc is in a hot code path and it'd be good to avoid tracking this. The only difference seems to be that we can see how much was allocated using zmalloc vs allocated in other ways (by libs, etc.). Is there any good reason to see the difference between these?

The INFO fields are used_memory (from zmalloc counter) vs allocator_allocated (stats from allocator). I imagine we can simply let used_memory be identical to allocator_allocated.

@PingXie
Copy link
Member

PingXie commented Jun 13, 2024

I think there are still 2 comments to be addressed

  1. the hard-coded max thread number
  2. zmalloc_used_memory can be even more inaccurate now with the PR. I kind of like @lipzhu's proposal at Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308 (comment) but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

je_mallctl("stats.allocated")

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

@lipzhu
Copy link
Contributor Author

lipzhu commented Jun 13, 2024

I imagine we can simply let used_memory be identical to allocator_allocated.

@zuiderkwast I am not clear the full history of used_memory, but I saw comments like below, the used_memory is designed to be different with allocator_allocated? And seems we only have allocator_allocated with jemalloc now.

    /* Unlike zmalloc_used_memory, this matches the stats.resident by taking
     * into account all allocations done by this process (not only zmalloc). */
    je_mallctl("stats.allocated", allocated, &sz, NULL, 0);

But I noticed that the metrics of used_memory and allocator_allocated are very closed during the benchmark test.

info memory
used_memory:46356571552
allocator_allocated:46356604048

the hard-coded max thread number

@PingXie Yes, I am also a little confused here, according to the suggestion #308 (comment), but how do we handle the modules, is there any way to get the modules number to be a factor of max thread number?

zmalloc_used_memory can be even more inaccurate now with the PR.

Can you be more specific.

I kind of like @lipzhu's proposal at #308 (comment) but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

I am OK with any proposal if your core-team make the final decision.

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

@zuiderkwast
Copy link
Contributor

Thanks for putting some focus on this performance bottleneck!

I commented in the issue #467.

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

@PingXie
Copy link
Member

PingXie commented Jun 13, 2024

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

Sorry I meant to ask if you had a chance to evaluate the overhead of je_mallctl("stats.allocated")

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

I think we should get a perf reading of this change first.

Thanks for putting some focus on this performance bottleneck!

+1. Appreciate it, @lipzhu!

@lipzhu
Copy link
Contributor Author

lipzhu commented Jun 14, 2024

@zuiderkwast @PingXie Update the patch as suggested in #467.

…n when update used_memory metrics.

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with some nit)

It's almost the code I posted in a comment. :) It was more fun to write code than to explain in text. Sorry for that.

src/zmalloc.c Outdated Show resolved Hide resolved
src/zmalloc.c Show resolved Hide resolved
@lipzhu lipzhu changed the title Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics to. Jun 14, 2024
@zuiderkwast
Copy link
Contributor

It seems we have a problem on macos:

zmalloc.c:55:10: fatal error: 'threads.h' file not found
#include <threads.h>
         ^~~~~~~~~~~

See https://stackoverflow.com/questions/16244153/clang-c11-threads-h-not-found

Maybe we need some conditional compilation like

#if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu lipzhu changed the title Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics to. Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. Jun 14, 2024
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

There's a "major-decision-pending" label on the issue, so let's wait for a majority of the @valkey-io/core-team (vote or approve) before we merge this.

@zuiderkwast zuiderkwast linked an issue Jun 14, 2024 that may be closed by this pull request
@madolson madolson added major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes labels Jun 17, 2024
@madolson madolson requested a review from soloestoy June 17, 2024 15:10
@soloestoy
Copy link
Member

I have some concerns:

  1. Now the threshold is 1MB, if this threshold is not reached, used_memory would not be updated. Some special cases, for example lazyfree, bio only frees 900KB memory, then no new jobs comes, used_memory will be inaccurate forever. I think we need a mechanism to compensate in order to achieve eventual consistency, not only through size thresholds but also time thresholds, such as updating the delta every 1 second.
  2. We have many threads, main thread, bio, io-threads, module threads. Too many threads may lead to larger errors, as in the situation in point 1, the error is the number of threads times 1MB. If the problem in point 1 can be resolved, then this would not be an issue either.

@zuiderkwast
Copy link
Contributor

  1. also time thresholds, such as updating the delta every 1 second

@soloestoy Isn't checking time in the allocator function too slow? It is a very hot code path.

Some alternative ideas:

  • Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.
  • Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

@soloestoy
Copy link
Member

Isn't checking time in the allocator function too slow? It is a very hot code path.

yes, it would be best if every thread can register a timer to trigger updating delta every second.

Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.

maybe 10KB is an acceptable error.

Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

native threads can work well, but module's thread is a problem.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Jun 19, 2024

it would be best if every thread can register a timer to trigger updating delta every second.

@soloestoy How to do this per thread in a simple way? Signal handler?

maybe 10KB is an acceptable error.

Let's do this then? I think it can provide much of the benefit already.

native threads can work well, but module's thread is a problem.

Simple solution: We can add do explicit flush to the atomic counter in ValkeyModule_Alloc and ValkeyModule_Free.

@lipzhu
Copy link
Contributor Author

lipzhu commented Jun 19, 2024

@soloestoy The PR is based on the assumption that we can accept used_memory is not accurate, the worst gap is just as you mentioned threads_num * 1M, you can refer @PingXie and @zuiderkwast discussion #467 for more details.

If we are not alignment of the absolute used_memory, I want to recall another proposal (actually the initial version of this PR), use a thread-local storage to maintain each thread's own used_memory and flush to used_memory through zmalloc_used_memory(), it is a bit like the opposite of rwlock, write parallelly and read exclusively. I create another
#674 to demonstrate the proposal for you core members to review and compare.

@lipzhu lipzhu closed this Jul 2, 2024
@lipzhu lipzhu deleted the atomic branch August 28, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-approved Major decision approved by TSC team performance release-notes This issue should get a line item in the release notes to-be-merged Almost ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Significance of absolute accurate used_memory?
7 participants