Io thread work offload #763

uriyage · 2024-07-09T06:15:40Z

IO-Threads Work Offloading

This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests per second.
(1st PR: #758)

This PR offloads additional work to the I/O threads, beyond the current read-parse/write operations, to better utilize the I/O threads and reduce the load on the main thread.

It contains the following 3 commits:

Poll Offload

Currently, the main thread is responsible for executing the poll-wait system call, while the IO threads wait for tasks from the main thread. The poll-wait operation is expensive and can consume up to 30% of the main thread's time. We could have let the IO threads do the poll-wait by themselves, with each thread listening to some of the clients and notifying the main thread when a client's command is ready to execute.

However, the current approach, where the main thread listens for events from the network, has several benefits. The main thread remains in charge, allowing it to know the state of each client (idle/read/write/close) at any given time. Additionally, it makes the threads flexible, enabling us to drain an IO thread's job queue and stop a thread when the load is light without modifying the event loop and moving its clients to a different IO thread. Furthermore, with this approach, the IO threads don't need to wait for both messages from the network and from the main thread instead, the threads wait only for tasks from the main thread.

To enjoy the benefits of both the main thread remaining in charge and the poll being offloaded, we propose offloading the poll-wait as a single-time, non-blocking job to one of the IO threads. The IO thread will perform a poll-wait non-blocking call while the main thread processes the client commands. Later, in aeProcessEvents, instead of sleeping on the poll, we check for the IO thread's poll-wait results.

The poll-wait will be offloaded in beforeSleep only when there are ready events for the main thread to process. If no events are pending, the main thread will revert to the current behavior and sleep on the poll by itself.

Implementation Details

A new call back custompoll was added to the aeEventLoop when not set to NULL the ae will call the custompoll callback instead of the aeApiPoll.

When the poll is offloaded we will set the custompoll to getIOThreadPollResults and send a poll-job to the thread. the thread will take a mutex, call a non-blocking (with timeout 0) to aePoll which will populate the fired events array. the IO thread will set the server.io_fired_events to the number of the returning numevents, later the main-thread in custompoll will return the server.io_fired_events and will set the customPoll back to NULL.

To ensure thread safety when accessing server.el, all functions that modify the eventloop events were wrapped with a mutex to ensure mutual exclusion when modifying the events.

Command Lookup Offload

As the IO thread parses the command from the client's Querybuf, it can perform a command lookup in the commands dictionary, which can consume up to ~5% of the main-thread runtime.

Implementation details
The IO thread will store the looked-up command in the client's new field io_parsed_cmd field. We can't use c->cmd for that since we use c->cmd to check if a command was reprocessed or not.

To ensure thread safety when accessing the command dictionary, we make sure the main thread isn't changing the dictionary while IO threads are accessing it. This is accomplished by introducing a new flag called no_incremental_rehash for the dictType commands. When performing dictResize, we will rehash the entire dictionary in place rather than deferring the process.

Free Offload

Since the command arguments are allocated by the I/O thread, it would be beneficial if they were also freed by the same thread. If the main thread frees objects allocated by the I/O thread, two issues arise:

During the freeing process, the main thread needs to access the SDS pointed to by the object to get its length.
With Jemalloc, each thread manages thread local pool (tcache) of buffers for quick reallocation without accessing the arena. If the main thread constantly frees objects allocated by other threads, those threads will have to frequently access the shared arena to obtain new memory allocations

Implementation Details
When freeing the client's argv, we will send the argv array to the thread that allocated it. The thread will be identified by the client ID. When freeing an object during dbOverwrite, we will offload the object free as well. We will extend this to offload the free during dbDelete in a future PR, as its effects on defrag/memory evictions need to be studied.

codecov · 2024-07-09T06:28:37Z

Codecov Report

Attention: Patch coverage is 38.27160% with 100 lines in your changes missing coverage. Please review.

Project coverage is 70.16%. Comparing base (548b4e0) to head (570e7cb).
Report is 16 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #763      +/-   ##
============================================
- Coverage     70.17%   70.16%   -0.02%     
============================================
  Files           112      112              
  Lines         60536    61263     +727     
============================================
+ Hits          42483    42986     +503     
- Misses        18053    18277     +224

Files	Coverage Δ
src/db.c	`88.40% <100.00%> (+<0.01%)`	⬆️
src/dict.c	`97.28% <100.00%> (-0.26%)`	⬇️
src/networking.c	`88.78% <100.00%> (+0.12%)`	⬆️
src/server.c	`88.55% <100.00%> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/module.c	`9.64% <0.00%> (-0.01%)`	⬇️
src/ae.c	`74.90% <52.00%> (-5.19%)`	⬇️
src/io_threads.c	`7.87% <10.95%> (+1.24%)`	⬆️

... and 21 files with indirect coverage changes

src/module.c

src/networking.c

src/io_threads.c

src/networking.c

zuiderkwast

Looks good to me in general.

Free offload adds some complexity. It would be good to know how important that one is for performance.

It seems this PR could have been three separate PRs:

poll offload
command lookup offload
free offload

src/ae.c

src/server.c

src/db.c

src/io_threads.c

src/networking.c

src/module.c

src/socket.c

src/ae_apidata.h

src/ae.h

src/ae.c

src/io_threads.c

ranshid

Overall looks fine. added some small comments/suggestions

src/ae.c

zuiderkwast

Thanks for moving the mutex to AE. I agree it's better.

src/dict.h

madolson

I have no other comments ontop of what viktor said. I'll give off some more tests tomorrow, and hopefully we can merge it soon!

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

madolson · 2024-07-18T03:25:27Z

*** [err]: benchmark: multi-thread set,get in tests/integration/valkey-benchmark.tcl

This failed twice, it seems like it might be an issue. Going to kick off wider tests now though.

madolson · 2024-07-18T03:29:22Z

All tests: https://github.com/valkey-io/valkey/actions/runs/9985021048
Asan and some tests with all io threads enabled: https://github.com/valkey-io/valkey/actions/runs/9985030583

We expect the second one to have some failures.

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage · 2024-07-18T06:01:59Z

All tests: https://github.com/valkey-io/valkey/actions/runs/9985021048 Asan and some tests with all io threads enabled: https://github.com/valkey-io/valkey/actions/runs/9985030583

We expect the second one to have some failures.

There were 4 errors in the tests:

Compile error for timeval struct - Fixed by using forward declaration.
Redis-benchmark error when freeing the event loop - Bug in lock/unlock in aeDeleteEventLoop. We attempted to unlock the event loop after it was freed. Fixed by removing the lock/unlock from this function, as it is only called by redis-benchmark.
Failures in cluster/slot-migration - Reproduced without the IO-threads changes. This appears to be due to flakiness in recent tests.
Tracking test error - Known issue being addressed in Skip tracking clients OOM test when I/O threads are enabled #764.

madolson · 2024-07-18T20:28:13Z

Pending tests to make sure issues are fixed: https://github.com/valkey-io/valkey/actions/runs/9998114971

madolson · 2024-07-19T02:20:58Z

Test was all green, even though we have some existing failures, going to merge this.

### IO-Threads Work Offloading This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests per second. (1st PR: valkey-io#758) This PR offloads additional work to the I/O threads, beyond the current read-parse/write operations, to better utilize the I/O threads and reduce the load on the main thread. It contains the following 3 commits: ### Poll Offload Currently, the main thread is responsible for executing the poll-wait system call, while the IO threads wait for tasks from the main thread. The poll-wait operation is expensive and can consume up to 30% of the main thread's time. We could have let the IO threads do the poll-wait by themselves, with each thread listening to some of the clients and notifying the main thread when a client's command is ready to execute. However, the current approach, where the main thread listens for events from the network, has several benefits. The main thread remains in charge, allowing it to know the state of each client (idle/read/write/close) at any given time. Additionally, it makes the threads flexible, enabling us to drain an IO thread's job queue and stop a thread when the load is light without modifying the event loop and moving its clients to a different IO thread. Furthermore, with this approach, the IO threads don't need to wait for both messages from the network and from the main thread instead, the threads wait only for tasks from the main thread. To enjoy the benefits of both the main thread remaining in charge and the poll being offloaded, we propose offloading the poll-wait as a single-time, non-blocking job to one of the IO threads. The IO thread will perform a poll-wait non-blocking call while the main thread processes the client commands. Later, in `aeProcessEvents`, instead of sleeping on the poll, we check for the IO thread's poll-wait results. The poll-wait will be offloaded in `beforeSleep` only when there are ready events for the main thread to process. If no events are pending, the main thread will revert to the current behavior and sleep on the poll by itself. **Implementation Details** A new call back `custompoll` was added to the `aeEventLoop` when not set to `NULL` the ae will call the `custompoll` callback instead of the `aeApiPoll`. When the poll is offloaded we will set the `custompoll` to `getIOThreadPollResults` and send a poll-job to the thread. the thread will take a mutex, call a non-blocking (with timeout 0) to `aePoll` which will populate the fired events array. the IO thread will set the `server.io_fired_events` to the number of the returning `numevents`, later the main-thread in `custompoll` will return the `server.io_fired_events` and will set the `customPoll` back to `NULL`. To ensure thread safety when accessing server.el, all functions that modify the eventloop events were wrapped with a mutex to ensure mutual exclusion when modifying the events. ### Command Lookup Offload As the IO thread parses the command from the client's Querybuf, it can perform a command lookup in the commands dictionary, which can consume up to ~5% of the main-thread runtime. **Implementation details** The IO thread will store the looked-up command in the client's new field `io_parsed_cmd` field. We can't use `c->cmd` for that since we use `c->cmd `to check if a command was reprocessed or not. To ensure thread safety when accessing the command dictionary, we make sure the main thread isn't changing the dictionary while IO threads are accessing it. This is accomplished by introducing a new flag called `no_incremental_rehash` for the `dictType` commands. When performing `dictResize`, we will rehash the entire dictionary in place rather than deferring the process. ### Free Offload Since the command arguments are allocated by the I/O thread, it would be beneficial if they were also freed by the same thread. If the main thread frees objects allocated by the I/O thread, two issues arise: 1. During the freeing process, the main thread needs to access the SDS pointed to by the object to get its length. 2. With Jemalloc, each thread manages thread local pool (`tcache`) of buffers for quick reallocation without accessing the arena. If the main thread constantly frees objects allocated by other threads, those threads will have to frequently access the shared arena to obtain new memory allocations **Implementation Details** When freeing the client's argv, we will send the argv array to the thread that allocated it. The thread will be identified by the client ID. When freeing an object during `dbOverwrite`, we will offload the object free as well. We will extend this to offload the free during `dbDelete` in a future PR, as its effects on defrag/memory evictions need to be studied. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>

ranshid reviewed Jul 9, 2024

View reviewed changes

src/module.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/io_threads.c Outdated Show resolved Hide resolved

src/io_threads.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

uriyage force-pushed the io-thread-work-offload branch from 9b06940 to a47dbee Compare July 9, 2024 15:01

madolson requested a review from zuiderkwast July 9, 2024 19:30

zuiderkwast reviewed Jul 9, 2024

View reviewed changes

uriyage force-pushed the io-thread-work-offload branch 2 times, most recently from e918ea3 to 5d11ff8 Compare July 12, 2024 07:02

zuiderkwast reviewed Jul 12, 2024

View reviewed changes

src/ae_apidata.h Outdated Show resolved Hide resolved

uriyage force-pushed the io-thread-work-offload branch from 5d11ff8 to a941909 Compare July 12, 2024 12:05

zuiderkwast reviewed Jul 12, 2024

View reviewed changes

src/ae.h Outdated Show resolved Hide resolved

madolson reviewed Jul 15, 2024

View reviewed changes

src/ae.c Outdated Show resolved Hide resolved

src/io_threads.c Outdated Show resolved Hide resolved

uriyage force-pushed the io-thread-work-offload branch 2 times, most recently from e36db8e to 8b14ebe Compare July 15, 2024 19:44

ranshid reviewed Jul 16, 2024

View reviewed changes

src/ae.c Outdated Show resolved Hide resolved

src/ae.c Outdated Show resolved Hide resolved

src/ae.c Outdated Show resolved Hide resolved

src/ae.c Outdated Show resolved Hide resolved

src/ae.c Show resolved Hide resolved

src/ae.c Outdated Show resolved Hide resolved

uriyage force-pushed the io-thread-work-offload branch from 8b14ebe to 61d0d18 Compare July 16, 2024 16:19

zuiderkwast reviewed Jul 16, 2024

View reviewed changes

src/dict.h Outdated Show resolved Hide resolved

madolson reviewed Jul 17, 2024

View reviewed changes

uriyage added 2 commits July 17, 2024 07:23

Offload command lookup to IO threads

76a7700

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

Free offload to IO threads

bbf53de

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the io-thread-work-offload branch from 61d0d18 to 2753365 Compare July 17, 2024 07:25

poll offload to io threads

570e7cb

Signed-off-by: Uri Yagelnik <uriy@amazon.com>

uriyage force-pushed the io-thread-work-offload branch from 2753365 to 570e7cb Compare July 18, 2024 05:54

madolson approved these changes Jul 18, 2024

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label Jul 18, 2024

madolson merged commit 94bc15c into valkey-io:unstable Jul 19, 2024
20 checks passed

uriyage mentioned this pull request Aug 1, 2024

Improve multithreaded performance with memory prefetching #861

Merged

arukiidou mentioned this pull request Sep 11, 2024

[NEW] Reaching 1 million requests per second on a single Valkey instance #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Io thread work offload #763

Io thread work offload #763

uriyage commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

zuiderkwast left a comment

ranshid left a comment

zuiderkwast left a comment

madolson left a comment

madolson commented Jul 18, 2024

madolson commented Jul 18, 2024

uriyage commented Jul 18, 2024

madolson commented Jul 18, 2024

madolson commented Jul 19, 2024

Io thread work offload #763

Io thread work offload #763

Conversation

uriyage commented Jul 9, 2024 • edited Loading

IO-Threads Work Offloading

Poll Offload

Command Lookup Offload

Free Offload

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

ranshid left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

madolson commented Jul 18, 2024

madolson commented Jul 18, 2024

uriyage commented Jul 18, 2024

madolson commented Jul 18, 2024

madolson commented Jul 19, 2024

uriyage commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading