Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

lipzhu · 2024-04-12T05:36:34Z

Description

This patch try to introduce Thread-local storage variable to replace atomic for zmalloc to reduce unnecessary contention.

Problem Statement

zmalloc and zfree related functions will update the used_memory for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio of zmalloc and zfree are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed by lock add and lock sub , they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.

Performance Impact

Test Env

OS: CentOS Stream 8
Kernel: 6.2.0
Platform: Intel Xeon Platinum 8380
Server and Client in same socket

Start Server

taskset -c 0 ~/valkey/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

By using the benchmark memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml.
memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --command "XADD __key__ MAXLEN ~ 1 * field __data__" --command-key-pattern="P" --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

We can observe more than 6% QPS gain.

For the other benchmarks SET/GET, using the commands like:
taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set,get -d 100 -r 1000000 -n 1000000 -c 50 --threads 4

No perf gain and regression.

With pipeline enabled, I can observe 4% perf gain with test case.
taskset -c 4-7 memtier_benchmark -s 127.0.0.1 -p 9001 "--pipeline" "10" "--data-size" "100" --ratio 1:0 --key-pattern P:P --key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 4 --hide-histogram

madolson

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

src/zmalloc.c

lipzhu · 2024-04-15T01:34:30Z

What are the server configurations you are using for the test? I didn't see them listed. Can you also just do a simple test with get/set?

Just update the server config and SET/GET results in top comment.

src/zmalloc.c

lipzhu · 2024-04-17T00:47:42Z

@valkey-io/core-team, could you help to take a look at this patch?

enjoy-binbin

i did not take a deep look, how do we handle module threads?

lipzhu · 2024-04-22T03:36:41Z

Hi @enjoy-binbin , thanks for your comments for this patch. BTW, your blog in yuque help me a lot in understanding redis/valkey.

i did not take a deep look, how do we handle module threads?

Sorry I missed the modules thread, maybe I can remove the explicit call of the zmalloc_register_thread_index in start routine, initialize static __thread int thread_index = -1 and then add a checker if (thread_index == -1) zmalloc_register_thread_index() in zmalloc/zfree?

codecov · 2024-04-22T09:58:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.19%. Comparing base (b546dd2) to head (aeb8159).
Report is 12 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #308      +/-   ##
============================================
- Coverage     70.22%   70.19%   -0.03%     
============================================
  Files           110      110              
  Lines         60039    60066      +27     
============================================
+ Hits          42163    42165       +2     
- Misses        17876    17901      +25

Files	Coverage Δ
src/evict.c	`97.34% <100.00%> (ø)`
src/zmalloc.c	`85.44% <100.00%> (+0.80%)`	⬆️

... and 14 files with indirect coverage changes

lipzhu · 2024-04-28T02:17:24Z

@enjoy-binbin @PingXie @madolson Thoughts about this patch?

lipzhu · 2024-05-31T06:50:59Z

Kindly ping @valkey-io/core-team.

zuiderkwast

This one seems almost ready to merge.

@PingXie @enjoy-binbin @madolson are there any open questions and suggestions how we can close them?

src/zmalloc.c

zuiderkwast · 2024-06-13T07:00:44Z

In the Valkey Contributor Summit, someone (who?) suggested that we remove this counter in zmalloc and instead rely on the stats from the allocator, like je_mallctl("stats.allocated"). I tend to agree with that. zmalloc is in a hot code path and it'd be good to avoid tracking this. The only difference seems to be that we can see how much was allocated using zmalloc vs allocated in other ways (by libs, etc.). Is there any good reason to see the difference between these?

The INFO fields are used_memory (from zmalloc counter) vs allocator_allocated (stats from allocator). I imagine we can simply let used_memory be identical to allocator_allocated.

PingXie · 2024-06-13T07:13:06Z

I think there are still 2 comments to be addressed

the hard-coded max thread number
zmalloc_used_memory can be even more inaccurate now with the PR. I kind of like @lipzhu's proposal at Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308 (comment) but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

je_mallctl("stats.allocated")

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

lipzhu · 2024-06-13T08:51:34Z

I imagine we can simply let used_memory be identical to allocator_allocated.

@zuiderkwast I am not clear the full history of used_memory, but I saw comments like below, the used_memory is designed to be different with allocator_allocated? And seems we only have allocator_allocated with jemalloc now.

    /* Unlike zmalloc_used_memory, this matches the stats.resident by taking
     * into account all allocations done by this process (not only zmalloc). */
    je_mallctl("stats.allocated", allocated, &sz, NULL, 0);

But I noticed that the metrics of used_memory and allocator_allocated are very closed during the benchmark test.

info memory
used_memory:46356571552
allocator_allocated:46356604048

the hard-coded max thread number

@PingXie Yes, I am also a little confused here, according to the suggestion #308 (comment), but how do we handle the modules, is there any way to get the modules number to be a factor of max thread number?

zmalloc_used_memory can be even more inaccurate now with the PR.

Can you be more specific.

I kind of like @lipzhu's proposal at #308 (comment) but I think there is a bug in the code snippet - shouldn't used_memory_thread be cleared after getting applied to used_memory? In other words, I think used_memory_thread should be used to track/cache the deltas that haven't been applied to the global counter.

I am OK with any proposal if your core-team make the final decision.

@lipzhu you mentioned that je_malloc_usable_size is very expensive. I am wondering where this call would stand?

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

zuiderkwast · 2024-06-13T17:19:13Z

Thanks for putting some focus on this performance bottleneck!

I commented in the issue #467.

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

PingXie · 2024-06-13T18:58:31Z

Each call of zmalloc will also call zmalloc_size to get the allocated size. And I observed the je_malloc_usable_size take about 6% CPU cycles.

Sorry I meant to ask if you had a chance to evaluate the overhead of je_mallctl("stats.allocated")

I think we can use jemalloc's stats and fallback to counting in zmalloc only if another allocator is used.

I think we should get a perf reading of this change first.

Thanks for putting some focus on this performance bottleneck!

+1. Appreciate it, @lipzhu!

lipzhu · 2024-06-14T11:41:50Z

@zuiderkwast @PingXie Update the patch as suggested in #467.

…n when update used_memory metrics. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

zuiderkwast

LGTM (with some nit)

It's almost the code I posted in a comment. :) It was more fun to write code than to explain in text. Sorry for that.

src/zmalloc.c

zuiderkwast · 2024-06-14T13:29:15Z

It seems we have a problem on macos:

zmalloc.c:55:10: fatal error: 'threads.h' file not found
#include <threads.h>
         ^~~~~~~~~~~

See https://stackoverflow.com/questions/16244153/clang-c11-threads-h-not-found

Maybe we need some conditional compilation like

#if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

zuiderkwast

LGTM.

There's a "major-decision-pending" label on the issue, so let's wait for a majority of the @valkey-io/core-team (vote or approve) before we merge this.

soloestoy · 2024-06-18T07:59:05Z

I have some concerns:

Now the threshold is 1MB, if this threshold is not reached, used_memory would not be updated. Some special cases, for example lazyfree, bio only frees 900KB memory, then no new jobs comes, used_memory will be inaccurate forever. I think we need a mechanism to compensate in order to achieve eventual consistency, not only through size thresholds but also time thresholds, such as updating the delta every 1 second.
We have many threads, main thread, bio, io-threads, module threads. Too many threads may lead to larger errors, as in the situation in point 1, the error is the number of threads times 1MB. If the problem in point 1 can be resolved, then this would not be an issue either.

zuiderkwast · 2024-06-18T10:06:52Z

also time thresholds, such as updating the delta every 1 second

@soloestoy Isn't checking time in the allocator function too slow? It is a very hot code path.

Some alternative ideas:

Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.
Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

soloestoy · 2024-06-19T08:29:18Z

Isn't checking time in the allocator function too slow? It is a very hot code path.

yes, it would be best if every thread can register a timer to trigger updating delta every second.

Use 100KB or 10KB instead of 1M to make the error smaller? Most of the allocations are small like less than 1K anyway.

maybe 10KB is an acceptable error.

Add some explicit call to flush this delta to the atomic variable, such as after each job in bio.c and IO threads.

native threads can work well, but module's thread is a problem.

zuiderkwast · 2024-06-19T09:10:58Z

it would be best if every thread can register a timer to trigger updating delta every second.

@soloestoy How to do this per thread in a simple way? Signal handler?

maybe 10KB is an acceptable error.

Let's do this then? I think it can provide much of the benefit already.

native threads can work well, but module's thread is a problem.

Simple solution: We can add do explicit flush to the atomic counter in ValkeyModule_Alloc and ValkeyModule_Free.

lipzhu · 2024-06-19T12:56:09Z

@soloestoy The PR is based on the assumption that we can accept used_memory is not accurate, the worst gap is just as you mentioned threads_num * 1M, you can refer @PingXie and @zuiderkwast discussion #467 for more details.

If we are not alignment of the absolute used_memory, I want to recall another proposal (actually the initial version of this PR), use a thread-local storage to maintain each thread's own used_memory and flush to used_memory through zmalloc_used_memory(), it is a bit like the opposite of rwlock, write parallelly and read exclusively. I create another
#674 to demonstrate the proposal for you core members to review and compare.

lipzhu changed the title ~~Introduce TLS variable to replace atomic when update used_memory metrics to reduce contention.~~ Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention. Apr 12, 2024

madolson reviewed Apr 14, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Outdated Show resolved Hide resolved

qlong reviewed Apr 15, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

qlong reviewed Apr 15, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Outdated Show resolved Hide resolved

enjoy-binbin reviewed Apr 19, 2024

View reviewed changes

lipzhu requested review from enjoy-binbin and madolson April 22, 2024 09:46

PingXie added the performance label Apr 28, 2024

This was referenced May 8, 2024

Add zfree_with_size to optimize sdsfree since we can get zmalloc_size from the header #453

Merged

Significance of absolute accurate used_memory? #467

Open

zuiderkwast reviewed Jun 13, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

zuiderkwast added the to-be-merged Almost ready to merge label Jun 13, 2024

lipzhu force-pushed the atomic branch from 15c79e4 to e49cf4b Compare June 14, 2024 11:32

lipzhu requested review from zuiderkwast and PingXie June 14, 2024 11:38

Introduce Thread-local storage variable to to reduce atomic contentio…

e49cf4b

…n when update used_memory metrics. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

zuiderkwast approved these changes Jun 14, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

src/zmalloc.c Show resolved Hide resolved

lipzhu changed the title ~~Introduce Thread-local storage variable to replace atomic when update used_memory metrics to reduce contention.~~ Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics to. Jun 14, 2024

use standard thread_local.

a860c21

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

lipzhu changed the title ~~Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics to.~~ Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. Jun 14, 2024

fix failure for macos.

aeb8159

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>

zuiderkwast approved these changes Jun 14, 2024

View reviewed changes

zuiderkwast linked an issue Jun 14, 2024 that may be closed by this pull request

Significance of absolute accurate used_memory? #467

Open

madolson added major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes labels Jun 17, 2024

madolson requested a review from soloestoy June 17, 2024 15:10

lipzhu mentioned this pull request Jun 19, 2024

Introduce thread-local storage variable to update thread's own used_memory and sum when reading to reduce atomic contention. #674

Merged

lipzhu closed this Jul 2, 2024

lipzhu deleted the atomic branch August 28, 2024 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

lipzhu commented Apr 12, 2024 •

edited

Loading

madolson left a comment

lipzhu commented Apr 15, 2024

lipzhu commented Apr 17, 2024

enjoy-binbin left a comment

lipzhu commented Apr 22, 2024 •

edited

Loading

codecov bot commented Apr 22, 2024 •

edited

Loading

lipzhu commented Apr 28, 2024

lipzhu commented May 31, 2024

zuiderkwast left a comment

zuiderkwast commented Jun 13, 2024

PingXie commented Jun 13, 2024

lipzhu commented Jun 13, 2024

zuiderkwast commented Jun 13, 2024

PingXie commented Jun 13, 2024

lipzhu commented Jun 14, 2024

zuiderkwast left a comment

zuiderkwast commented Jun 14, 2024

zuiderkwast left a comment •

edited

Loading

soloestoy commented Jun 18, 2024

zuiderkwast commented Jun 18, 2024

soloestoy commented Jun 19, 2024

zuiderkwast commented Jun 19, 2024 •

edited

Loading

lipzhu commented Jun 19, 2024

Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

Introduce Thread-local storage variable to reduce atomic contention when update used_memory metrics. #308

Conversation

lipzhu commented Apr 12, 2024 • edited Loading

Description

Problem Statement

Performance Impact

Test Env

Start Server

madolson left a comment

Choose a reason for hiding this comment

lipzhu commented Apr 15, 2024

lipzhu commented Apr 17, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

lipzhu commented Apr 22, 2024 • edited Loading

codecov bot commented Apr 22, 2024 • edited Loading

Codecov Report

lipzhu commented Apr 28, 2024

lipzhu commented May 31, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast commented Jun 13, 2024

PingXie commented Jun 13, 2024

lipzhu commented Jun 13, 2024

zuiderkwast commented Jun 13, 2024

PingXie commented Jun 13, 2024

lipzhu commented Jun 14, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast commented Jun 14, 2024

zuiderkwast left a comment • edited Loading

Choose a reason for hiding this comment

soloestoy commented Jun 18, 2024

zuiderkwast commented Jun 18, 2024

soloestoy commented Jun 19, 2024

zuiderkwast commented Jun 19, 2024 • edited Loading

lipzhu commented Jun 19, 2024

lipzhu commented Apr 12, 2024 •

edited

Loading

lipzhu commented Apr 22, 2024 •

edited

Loading

codecov bot commented Apr 22, 2024 •

edited

Loading

zuiderkwast left a comment •

edited

Loading

zuiderkwast commented Jun 19, 2024 •

edited

Loading