Support for key locking #3

madolson · 2024-03-22T05:02:04Z

There are several features on our roadmap that assume that we will develop some form of locking on an individual key. I think there is only one commonly accepted approach to achieving this, which is to mark a key as "in use" and block commands that attempt to mutate that key. Commands may either take a "shared lock", multiple different places can take a "shared lock" on the key, or an "exclusive lock", only one such lock can be granted at a given time.

This issue is just a high level summary of all the known issues related to key locking. Most of this is derived from redis/redis#7372 (comment). Re-posting verbatim here:

Even if the command is not explicitly writing, there are several cases where data is modified when reading. These include: PFCOUNT and also any data structure where the format can be internally changed - like hash table, skiplist, compressed list encoding which can be in the process of rehashing or other data transformation. These modifications must be prevented in both main and background threads.
There are significant complexities with MULTI/EXEC. If a MULTI/EXEC hit’s a blocked key, it must either:
synchronously pause the Redis main thread (waiting for the background thread to release the lock) - this also comes with complexity regarding completing/joining the background command while the main thread is paused.
pre-examine the MULTI/EXEC to determine if any locked keys will be used. This also comes with the challenge of tracking keys across databases if SWAPDB and/or SELECT is used in the MULTI/EXEC block.
block all MULTI/EXEC commands if any async command is running.
There are significant challenges with LUA. LUA is not forced to declare keys used. Furthermore, even if a LUA script does declare which keys are touched, the declaration does not indicate which DB the key is associated with. There is no mechanism to determine keys by inspection (as is possible with MULTI/EXEC). The safest option for LUA might be to block any LUA script if any async command is running. Throwing an error is not an option as it breaks atomicity guarantees.
There is a conflict with FLUSHDB/FLUSHALL. These keyless commands would need to be blocked until no background thread was running.
There is the possibility of interaction with EXPIRE and EVICT processing. Deep dive is necessary here to understand the issues with each.
Client disconnect (on the main thread) needs to interact with async processing in a background thread.
BRPOPLPUSH is a special can of worms and needs to be carefully considered. Consider BRPOPLPUSH FROM TO. Later, while a background thread has “TO” locked, another client performs an LPUSH FROM. The LPUSH (to a different key) would need to be blocked OR the BRPOPLPUSH would need to be further blocked.
Should MONITOR show the executed command at the beginning of processing? or the end?
I know you’re currently looking at background reads, but if the slow command needs to perform a write, the logical lock must also block readers on the main thread, not just writers.
If the background command is a write command, unlocking of the key needs to be coordinated with replication.
In addition, an important part of this proposal involves the blocking of clients until a key is unlocked. The existing blocking mechanism (for blocking commands like BRPOPLPUSH) is a good candidate for re-work at this juncture.

The existing blocking mechanism works as follows:

The blocking command is processed normally through processCommand().
In the command’s handler, the command is removed from the client’s normal cmd/argv/argc fields and copied over to custom fields for the command. (The normal fields are cleared.)
When the command is unblocked, there is a completely separate code path (independent from processCommand()) which is used to finish execution of the blocking command.
Not only does a separate code path require independent maintenance, but can lead to subtle differences in the way commands function if blocked. (Hint: IIRC, MONITORing a BRPOPLPUSH will look subtly different if it blocks or not.)
Instead, changing the blocking mechanism will better support this new use-case (and potentially others).

When a client is blocked (for any reason), leave the cmd/argv/argc attached on the client as normal.
When the client is unblocked, pass it back through processCommand() to execute as normal. This might result in the client being re-blocked for another reason.
Some care must be taken to ensure that MONITOR works as expected and only logs the 1st attempt.
This approach eliminates a bunch of blocking command fields on the client struct. It eliminates a separate code path for finishing these commands. It provides a standard approach to blocking clients.
Note that some adjustment of timeout might be necessary when re-blocking.
Example: A BRPOPLPUSH is blocked waiting on something to pop. A LPUSH is performed which unblocks the BRPOPLPUSH. We attempt to execute BRPOPLPUSH normally by sending it through processCommand(). However it is discovered that the destination key is locked. The BRPOPLPUSH client is re-blocked. Later, when the target key is unlocked, we again send BRPOPLPUSH through processCommand() and this time it succeeds.

Example: “MSET k1 v1 k2 v2” is performed. It is recognized that k1 is locked. The client is blocked. When k1 is unlocked, the MSET command is passed back through processCommand(). It is determined that now, k2 is locked. The client is blocked a second time. When k2 is unlocked, the command is sent through processCommand() a third time, and may execute normally.

Old ref: redis/redis#11396

zuiderkwast · 2024-03-25T14:56:41Z

Will key locking allow threaded command execution? If yes, then I think we'll need this for sure, though I think there's a lot more to consider. Perhaps we should have a global lock that needs to be taken for changing anything else, like config, acl, etc.

madolson · 2024-03-26T02:04:34Z

We may not need it for multi-threaded execution. You can have less granular locks, such as on a hash table bucket, that also enables multi-threaded execution.

daniel-house · 2024-04-01T06:54:22Z

Perhaps locking on slots? Less granular that hash-table-buckets, but supporting more than enough threads in the dictionary at the same time. It looks like a good fit for the way cluster-mode denies multi-slot commands. This might also minimize the problems with LUA and MULTI/EXEC.

zuiderkwast · 2024-04-02T11:22:32Z

For cluster, we could do one mutex per cluster slot, but in a standalone instance, the keys are not stored per slot. We could consider storing them per slot anyway though, but computing the slot using CRC16 takes some extra CPU cycles. We could take some bits off the hash function used for the hash table instead and just have an array of some mutexes. If we do a lock per key or per hashtable bucket, a mutex takes too much memory but we could use a spinlock.

I want to look into a more modern hash table implementation. One that can use vector instructions (like AVX2) and be more cache friendly, open addressing. If we do that, perhaps there is some logical place to put a lock.

zuiderkwast · 2024-04-02T12:05:34Z

We may want to take a look at Dragonfly too. From its README:

To provide atomicity guarantees for multi-key operations, we use the advancements from recent academic research. We chose the paper "VLL: a lock manager redesign for main memory database systems” to develop the transactional framework for Dragonfly. The choice of shared-nothing architecture and VLL allowed us to compose atomic multi-key operations without using mutexes or spinlocks. This was a major milestone for our PoC and its performance stood out from other commercial and open-source solutions.

…is missed cases to redis-server. (#12322) Observed that the sanitizer reported memory leak as clean up is not done before the process termination in negative/following cases: **- when we passed '--invalid' as option to redis-server.** ``` -vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid *** FATAL CONFIG FILE ERROR (Redis 255.255.255) *** Reading the configuration file, at line 2 >>> 'invalid' Bad directive or wrong number of arguments ================================================================= ==865778==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7f0985f65867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 valkey-io#1 0x558ec86686ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117 valkey-io#2 0x558ec86686ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135 valkey-io#3 0x558ec86686ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276 valkey-io#4 0x558ec86686ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327 valkey-io#5 0x558ec865dd7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172 valkey-io#6 0x558ec87a1be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 valkey-io#7 0x558ec87a13b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 valkey-io#8 0x558ec85e6f15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 valkey-io#9 0x7f09856e5d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s). ``` **- when we pass '--port' as option and missed to add port number to redis-server.** ``` vm:~/mem-leak-issue/redis$ ./src/redis-server --port *** FATAL CONFIG FILE ERROR (Redis 255.255.255) *** Reading the configuration file, at line 2 >>> 'port' wrong number of arguments ================================================================= ==865846==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7fdcdbb1f867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 valkey-io#1 0x557e8b04f6ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117 valkey-io#2 0x557e8b04f6ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135 valkey-io#3 0x557e8b04f6ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276 valkey-io#4 0x557e8b04f6ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327 valkey-io#5 0x557e8b044d7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172 valkey-io#6 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 valkey-io#7 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 valkey-io#8 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 valkey-io#9 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 Indirect leak of 10 byte(s) in 1 object(s) allocated from: #0 0x7fdcdbb1fc18 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164 valkey-io#1 0x557e8b04f9aa in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:287 valkey-io#2 0x557e8b04f9aa in ztryrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:317 valkey-io#3 0x557e8b04f9aa in zrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:342 valkey-io#4 0x557e8b033f90 in _sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:271 valkey-io#5 0x557e8b033f90 in sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:295 valkey-io#6 0x557e8b033f90 in sdscatlen /home/ubuntu/mem-leak-issue/redis/src/sds.c:486 valkey-io#7 0x557e8b044e1f in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1165 valkey-io#8 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 valkey-io#9 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 valkey-io#10 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 valkey-io#11 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 SUMMARY: AddressSanitizer: 18 byte(s) leaked in 2 allocation(s). ``` As part analysis found that the sdsfreesplitres is not called when this condition checks are being hit. Output after the fix: ``` vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid *** FATAL CONFIG FILE ERROR (Redis 255.255.255) *** Reading the configuration file, at line 2 >>> 'invalid' Bad directive or wrong number of arguments vm:~/mem-leak-issue/redis$ =========================================== vm:~/mem-leak-issue/redis$ ./src/redis-server --jdhg *** FATAL CONFIG FILE ERROR (Redis 255.255.255) *** Reading the configuration file, at line 2 >>> 'jdhg' Bad directive or wrong number of arguments --------------------------------------------------------------------------- vm:~/mem-leak-issue/redis$ ./src/redis-server --port *** FATAL CONFIG FILE ERROR (Redis 255.255.255) *** Reading the configuration file, at line 2 >>> 'port' wrong number of arguments ``` Co-authored-by: Oran Agra <oran@redislabs.com>

## Issues and solutions from #12817 1. Touch ProcessingEventsWhileBlocked and calling moduleCount() without GIL in afterSleep() - Introduced: Version: 7.0.0 PR: #9963 - Harm Level: Very High If the module thread calls `RM_Yield()` before the main thread enters afterSleep(), and modifies `ProcessingEventsWhileBlocked`(+1), it will cause the main thread to not wait for GIL, which can lead to all kinds of unforeseen problems, including memory data corruption. - Initial / Abandoned Solution: * Added `__thread` specifier for ProcessingEventsWhileBlocked. `ProcessingEventsWhileBlocked` is used to protect against nested event processing, but event processing in the main thread and module threads should be completely independent and unaffected, so it is safer to use TLS. * Adding a cached module count to keep track of the current number of modules, to avoid having to use `dictSize()`. - Related Warnings: ``` WARNING: ThreadSanitizer: data race (pid=1136) Write of size 4 at 0x0001045990c0 by thread T4 (mutexes: write M0): #0 processEventsWhileBlocked networking.c:4135 (redis-server:arm64+0x10006d124) valkey-io#1 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c) valkey-io#2 bg_call_worker <null>:83232836 (blockedclient.so:arm64+0x16a8) Previous read of size 4 at 0x0001045990c0 by main thread: #0 afterSleep server.c:1861 (redis-server:arm64+0x100024f98) valkey-io#1 aeProcessEvents ae.c:408 (redis-server:arm64+0x10000fd64) valkey-io#2 aeMain ae.c:496 (redis-server:arm64+0x100010f0c) valkey-io#3 main server.c:7220 (redis-server:arm64+0x10003f38c) ``` 2. aeApiPoll() is not thread-safe When using RM_Yield to handle events in a module thread, if the main thread has not yet entered `afterSleep()`, both the module thread and the main thread may touch `server.el` at the same time. - Introduced: Version: 7.0.0 PR: #9963 - Old / Abandoned Solution: Adding a new mutex to protect timing between after beforeSleep() and before afterSleep(). Defect: If the main thread enters the ae loop without any IO events, it will wait until the next timeout or until there is any event again, and the module thread will always hang until the main thread leaves the event loop. - Related Warnings: ``` SUMMARY: ThreadSanitizer: data race ae_kqueue.c:55 in addEventMask ================== ================== WARNING: ThreadSanitizer: data race (pid=14682) Write of size 4 at 0x000100b54000 by thread T9 (mutexes: write M0): #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588) valkey-io#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84) valkey-io#2 processEventsWhileBlocked networking.c:4138 (redis-server:arm64+0x10006d3c4) valkey-io#3 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c) valkey-io#4 bg_call_worker <null>:16042052 (blockedclient.so:arm64+0x169c) Previous write of size 4 at 0x000100b54000 by main thread: #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588) valkey-io#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84) valkey-io#2 aeMain ae.c:496 (redis-server:arm64+0x100010da8) valkey-io#3 main server.c:7238 (redis-server:arm64+0x10003f51c) ``` ## The final fix as the comments: redis/redis#12817 (comment) Optimized solution based on the above comment: First, we add `module_gil_acquring` to indicate whether the main thread is currently in the acquiring GIL state. When the module thread starts to yield, there are two possibilities(we assume the caller keeps the GIL): 1. The main thread is in the mid of beforeSleep() and afterSleep(), that is, `module_gil_acquring` is not 1 now. At this point, the module thread will wake up the main thread through the pipe and leave the yield, waiting for the next yield when the main thread may already in the acquiring GIL state. 2. The main thread is in the acquiring GIL state. The module thread release the GIL, yielding CPU to give the main thread an opportunity to start event processing, and then acquire the GIL again until the main thread releases it. This is what redis/redis#12817 (comment) mentioned direction. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

natoscott mentioned this issue May 1, 2024

[NEW] Valkey Modules Bundling #408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for key locking #3

Support for key locking #3

madolson commented Mar 22, 2024

zuiderkwast commented Mar 25, 2024

madolson commented Mar 26, 2024

daniel-house commented Apr 1, 2024

zuiderkwast commented Apr 2, 2024

zuiderkwast commented Apr 2, 2024

Support for key locking #3

Support for key locking #3

Comments

madolson commented Mar 22, 2024

zuiderkwast commented Mar 25, 2024

madolson commented Mar 26, 2024

daniel-house commented Apr 1, 2024

zuiderkwast commented Apr 2, 2024

zuiderkwast commented Apr 2, 2024