Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort out mess around propagation and MULTI/EXEC #9890

Merged
merged 22 commits into from Dec 22, 2021

Conversation

guybe7
Copy link
Collaborator

@guybe7 guybe7 commented Dec 3, 2021

The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()), causing edge cases, ugly/hacky code, and the tendency for bugs

The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the top-most call() is responsible for going over that list and actually propagating them (and wrapping them in MULTI/EXEC if there's more than one command). This is done in the new function, propagatePendingCommands.

Callers to propagatePendingCommands:

  1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most one to propagate them) - via afterCommand
  2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via afterCommand.
  3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the expire but will not unblock any client so afterCommand isn't called. in that case, we have to propagate the deletion explicitly.
  4. cron stuff: active-expire and eviction may also propagate stuff
  5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications, threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module context may cause propagation.
  6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when releasing the GIL.

A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with ! and RM_Replicate, the command would propagate out-of-order: first all the commands from RM_Call, and then the ones from RM_Replicate

Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant. not anymore.

This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.

Optimizations:

  1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
  2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove

Bugfixes:

  1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
    we need to prevent it from propagating to AOF/replicas
  2. We need to set current_client in RM_Call. buggy scenario:
    • CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
    • assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
  3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
    (we always send a notification before propagating the command)

@guybe7 guybe7 requested a review from oranagra December 3, 2021 12:52
src/multi.c Show resolved Hide resolved
@guybe7 guybe7 removed the request for review from oranagra December 3, 2021 13:08
@guybe7 guybe7 marked this pull request as draft December 3, 2021 13:08
@guybe7 guybe7 marked this pull request as ready for review December 3, 2021 14:29
@guybe7 guybe7 requested a review from oranagra December 3, 2021 14:30
@zuiderkwast
Copy link
Contributor

"Sort out mess" 👍 😁

@oranagra
Copy link
Member

oranagra commented Dec 5, 2021

@guybe7 thanks..

Few requests while i review:

  1. please start off the top comment to describe what was that mess and what complications it caused. i.e. some parts using "also" late propagation, others using an immediate one, causing edge cases, ugly / tacky code, and tendency for bugs?.
  2. please look into the sanitizer timeout failure in LATENCY of expire events are correctly collected, i don't recall seeing it before.

@oranagra oranagra added this to Backlog in 7.0 via automation Dec 5, 2021
@oranagra oranagra moved this from Backlog to In Review in 7.0 Dec 5, 2021
@guybe7
Copy link
Collaborator Author

guybe7 commented Dec 5, 2021

@oranagra i can't figure it out, i ran it locally (build with SANITIZER=address and ran the tests) and it passed

@oranagra
Copy link
Member

oranagra commented Dec 5, 2021

I've re triggered the tests and got the same outcome.
Maybe it has something to do with a timing issue issue causing a hung, or maybe related to the gcc version being used?

src/aof.c Outdated Show resolved Hide resolved
src/blocked.c Show resolved Hide resolved
src/blocked.c Outdated Show resolved Hide resolved
src/evict.c Outdated Show resolved Hide resolved
src/module.c Outdated Show resolved Hide resolved
tests/unit/moduleapi/propagate.tcl Show resolved Hide resolved
tests/unit/moduleapi/propagate.tcl Show resolved Hide resolved
tests/unit/moduleapi/propagate.tcl Show resolved Hide resolved
tests/unit/moduleapi/propagate.tcl Outdated Show resolved Hide resolved
tests/unit/multi.tcl Show resolved Hide resolved
@guybe7 guybe7 force-pushed the prop_fix branch 2 times, most recently from 82d8f81 to 07f8a39 Compare December 8, 2021 14:22
src/blocked.c Show resolved Hide resolved
src/server.h Outdated Show resolved Hide resolved
src/blocked.c Outdated Show resolved Hide resolved
src/module.c Outdated Show resolved Hide resolved
src/server.c Show resolved Hide resolved
src/server.c Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Show resolved Hide resolved
src/expire.c Show resolved Hide resolved
@oranagra
Copy link
Member

oranagra commented Dec 9, 2021

@soloestoy i'll be happy if you can have a look and see if you can spot any issues with the new approach.

@soloestoy
Copy link
Collaborator

@soloestoy i'll be happy if you can have a look and see if you can spot any issues with the new approach.

OK, I'll check in a day or two, too busy this week...

bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
   we need to prevent it from propagating to AOF/replicas
2. we need to set current_client in RM_Call. buggy scenario:
   - CONFIG SET maxmemory, notifications, module hook calls RM_Call
   - assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
   (we always send a notification before propagating the command)
panjf2000 pushed a commit to panjf2000/redis that referenced this pull request Feb 3, 2022
This would mean that the effects of `CONFIG SET maxmemory` may not be visible once the command returns.
That could anyway happen since incremental eviction was added in redis 6.2 (see redis#7653)

We do this to fix one of the propagation bugs about eviction see redis#9890 and redis#10014.
MeirShpilraien added a commit to RedisGears/RedisGears that referenced this pull request Feb 13, 2022
This PR is for future compatability to Redis 7.
Till Redis 7, if commands would have executed from
background thread, then Redis would not have wrap
them with multi exec and the commands would not
have executed atomically on replica. Gears solved
it by propagating multi exec by itself.

Thanks to this PR: redis/redis#9890
It is no longer needed to replicate multi exec,
Redis takes care of this for us.
oranagra added a commit to oranagra/redis that referenced this pull request Apr 12, 2022
RM_Yield was missing a call to protectClient to prevent redis from
processing future commands of the yielding client.

Adding tests that fail without this fix.

This would be complicated to solve since nested calls to RM_Call used to
replace the current_client variable with the module temp client.

It looks like it's no longer necessary to do that, since it was added
back in redis#9890 to solve two issues, both already gone:
1. call to CONFIG SET maxmemory could trigger a module hook calling
   RM_Call. although this specific issue is gone, arguably other hooks
   like keyspace notification, can do the same.
2. an assertion in lookupKey that checks the current command of the
   current client, introduced in redis#9572 and removed in redis#10248
oranagra added a commit that referenced this pull request Apr 18, 2022
…10573)

RM_Yield was missing a call to protectClient to prevent redis from
processing future commands of the yielding client.

Adding tests that fail without this fix.

This would be complicated to solve since nested calls to RM_Call used to
replace the current_client variable with the module temp client.

It looks like it's no longer necessary to do that, since it was added
back in #9890 to solve two issues, both already gone:
1. call to CONFIG SET maxmemory could trigger a module hook calling
   RM_Call. although this specific issue is gone, arguably other hooks
   like keyspace notification, can do the same.
2. an assertion in lookupKey that checks the current command of the
   current client, introduced in #9572 and removed in #10248
guybe7 added a commit to guybe7/redis that referenced this pull request Apr 20, 2022
If was first added in redis#9890 to solve the problem of
CONFIG SET maxmemory causing eviction inside MULTI/EXEC,
but that problem is already fixed (CONFIG SET doesn't evict
directly, it just schedules a later eviction)

Keep that condition may hide bugs (i.e. performEvictions
should always expect to have an empty server.also_propagate)
oranagra pushed a commit that referenced this pull request Apr 24, 2022
If was first added in #9890 to solve the problem of
CONFIG SET maxmemory causing eviction inside MULTI/EXEC,
but that problem is already fixed (CONFIG SET doesn't evict
directly, it just schedules a later eviction)

Keep that condition may hide bugs (i.e. performEvictions
should always expect to have an empty server.also_propagate)
oranagra pushed a commit that referenced this pull request Apr 25, 2022
This case is interesting because it originates from cron,
rather than from another command.

The idea came from looking at #9890 and #10573, and I was wondering if RM_Call
would work properly when `server.current_client == NULL`
oranagra added a commit to oranagra/redis that referenced this pull request Aug 23, 2022
Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)

Closes redis#11014

getNodeByQuery
oranagra added a commit to oranagra/redis that referenced this pull request Aug 23, 2022
Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)

Closes redis#11014

Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
oranagra added a commit that referenced this pull request Aug 24, 2022
…#11176)

Redis 7.0 has #9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)
oranagra added a commit to oranagra/redis that referenced this pull request Sep 21, 2022
…redis#11176)

Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)

(cherry picked from commit c789fb0)
oranagra added a commit that referenced this pull request Sep 21, 2022
…#11176)

Redis 7.0 has #9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)

(cherry picked from commit c789fb0)
oranagra pushed a commit that referenced this pull request Feb 14, 2023
Starting from Redis 7.0 (#9890) we started wrapping everything a command
 propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.

Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)

This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.

Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened #11792 to try to handle the
bigger problem with lazy expire better for another release.
oranagra pushed a commit to oranagra/redis that referenced this pull request Feb 27, 2023
Starting from Redis 7.0 (redis#9890) we started wrapping everything a command
 propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.

Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)

This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.

Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened redis#11792 to try to handle the
bigger problem with lazy expire better for another release.

(cherry picked from commit fd82bcc)
oranagra pushed a commit that referenced this pull request Feb 28, 2023
Starting from Redis 7.0 (#9890) we started wrapping everything a command
 propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.

Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)

This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.

Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened #11792 to try to handle the
bigger problem with lazy expire better for another release.

(cherry picked from commit fd82bcc)
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
…redis#11176)

Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)
Mixficsol pushed a commit to Mixficsol/redis that referenced this pull request Apr 12, 2023
Starting from Redis 7.0 (redis#9890) we started wrapping everything a command
 propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.

Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)

This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.

Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened redis#11792 to try to handle the
bigger problem with lazy expire better for another release.
madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
…redis#11176)

Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…edis#10573)

RM_Yield was missing a call to protectClient to prevent redis from
processing future commands of the yielding client.

Adding tests that fail without this fix.

This would be complicated to solve since nested calls to RM_Call used to
replace the current_client variable with the module temp client.

It looks like it's no longer necessary to do that, since it was added
back in redis#9890 to solve two issues, both already gone:
1. call to CONFIG SET maxmemory could trigger a module hook calling
   RM_Call. although this specific issue is gone, arguably other hooks
   like keyspace notification, can do the same.
2. an assertion in lookupKey that checks the current command of the
   current client, introduced in redis#9572 and removed in redis#10248
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
If was first added in redis#9890 to solve the problem of
CONFIG SET maxmemory causing eviction inside MULTI/EXEC,
but that problem is already fixed (CONFIG SET doesn't evict
directly, it just schedules a later eviction)

Keep that condition may hide bugs (i.e. performEvictions
should always expect to have an empty server.also_propagate)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
This case is interesting because it originates from cron,
rather than from another command.

The idea came from looking at redis#9890 and redis#10573, and I was wondering if RM_Call
would work properly when `server.current_client == NULL`
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…redis#11176)

Redis 7.0 has redis#9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.

This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Starting from Redis 7.0 (redis#9890) we started wrapping everything a command
 propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.

Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)

This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.

Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened redis#11792 to try to handle the
bigger problem with lazy expire better for another release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus
Projects
Archived in project
7.0
Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants