script should not allow may-replicate commands when client pause write #10364

soloestoy · 2022-03-01T09:13:08Z

In some special commands like eval_ro/fcall_ro we allow no-writes commands. But may-replicate commands are no-writes too, that leads crash when client pause write:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
51328:M 01 Mar 2022 17:08:59.273 # === ASSERTION FAILED ===
51328:M 01 Mar 2022 17:08:59.273 # ==> server.c:3036 '!(areClientsPaused() && !server.client_pause_in_transaction)' is not true

------ STACK TRACE ------

Backtrace:
./redis-server *:6789[0x447127]
./redis-server *:6789(propagatePendingCommands+0x72)[0x44ab82]
./redis-server *:6789(afterCommand+0x33)[0x44ae53]
./redis-server *:6789(call+0x31a)[0x44b17a]
./redis-server *:6789(processCommand+0x990)[0x44cf90]
./redis-server *:6789(processCommandAndResetClient+0x1c)[0x462a5c]
./redis-server *:6789(processInputBuffer+0xd1)[0x465541]
./redis-server *:6789(readQueryFromClient+0x238)[0x468558]
./redis-server *:6789[0x4fd373]
./redis-server *:6789(aeProcessEvents+0x235)[0x443735]
./redis-server *:6789(aeMain+0x1d)[0x443a7d]
./redis-server *:6789(main+0x406)[0x43fe16]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f509184f445]
./redis-server *:6789[0x4401cd]

...

------ CURRENT CLIENT INFO ------
id=3 addr=127.0.0.1:46466 laddr=127.0.0.1:6789 fd=7 name= age=3 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=67 qbuf-free=20407 argv-mem=44 multi-mem=0 rbs=1024 rbp=5 obl=4 oll=0 omem=0 tot-mem=22340 events=r cmd=eval_ro user=default redir=-1 resp=2
argv[0]: '"eval_ro"'
argv[1]: '"return redis.call('publish','ch','msg')"'
argv[2]: '"0"'

src/script.c

oranagra

after applying the suggestions, you'll probably need to update the test too.

src/script.c

oranagra · 2022-03-08T14:58:12Z

@soloestoy @madolson i merged this and figured it doesn't need to be mentioned in the release notes since this problem is new to redis 7.0 (EVAL_RO is new).

But it made me realize that maybe our decision in 6.2 to flag EVAL as may-replicate was not wise, and instead we should have relied on a check like the one added in this PR.
i.e. it means that someone will be able to keep running scripts with solely read-only commands using EVAL (rather than forcing them to use EVAL_RO).
WDYT?

reminding us that the main reason for adding EVAL_RO was for the command flags, so that cluster aware clients can propagate that to replicas too.

soloestoy · 2022-05-12T09:29:33Z

But it made me realize that maybe our decision in 6.2 to flag EVAL as may-replicate was not wise, and instead we should have relied on a check like the one added in this PR.
i.e. it means that someone will be able to keep running scripts with solely read-only commands using EVAL (rather than forcing them to use EVAL_RO).

not sure which way is better, but seems we have different behaviors between EVAL and EVAL_RO when write paused:

eval "redis.call('publish','c','m')" 0 would be blocked but succeed after unpaused.
eval_ro "redis.call('publish','c','m')" 0 would failed.

BTW, transaction with write or may-replicate command performs like eval_ro.

I think we should unify the behavior about EVAL and EVAL_RO (FCALL and FCALL_RO), but not sure which one is better, WDYT @oranagra @madolson

oranagra · 2022-05-12T10:18:39Z

BTW, transaction with write or may-replicate command performs like eval_ro.

yes, but the difference is that it won't fail all transactions, just ones with write / may-replicate commands.
for EVAL, we'll fail any attempt to run a script, even one with solely read-only commands.
but obviously the other difference is that in a transactions we know in advance which commands they use, and for scripts we could fail the script after some commands were already executed (breaking it's atomicity)

maybe that means the fix in this PR is bad and instead we need to consistently block may-replicate commands in EVAL_RO (so that we don't fail the script half way)?
and maybe add shebang flags to control it? i.e. a script that declared a shebang no-replicate will be allowed to run (with both EVAL and EVAL_RO) during client pause..

@MeirShpilraien @yossigo please join this discussion as well.

oranagra · 2022-05-12T15:14:27Z

I've discussed this with Yossi, we propose the following:

consider PFCOUNT and PUBLISH (the may-replicate commands, other than EVAL) as write commands in scripts, this means that they can never be used in scripts with shebang and no no-writes flag, and also not in the _RO command variants.
block them in eval_RO always (even in scripts without shebang)
allow no-write scripts in EVAL (not just in EVAL_RO), even during client pause writes.
i.e. if the script declared its intention ahead of time, we don't risk breaking atomicity by stopping it in the middle, and we consider the intention declaration of the script the same (or stronger) as the intention declaration of the caller (which command is being used)

@madolson please join this discussion, in the context of #8820, i think it means again that the main purpose of EVAL_RO is for client library routing.

madolson · 2022-05-15T02:36:58Z

Sorry I've been under rock.

I strongly agree blocking PFCOUNT and PUBLISH (and all may-replicate) from EVAL_RO scripts as well as "no-write" scripts. I think any mechanism that enforces that is sufficient, but I think blocking them is the best approach from all scripts.
Yes
I'm not entirely sure what this means. Normally* a client pause can't be started midway through the script, since it can't call any of the commands that can start the pause, so I assume you mean starting a no-write EVAL during the pause. The client pause check happens in processCommand(), which is before we've parsed the lua script. There is a bunch of new accounting we have to keep track of since they now act more like BPOP blocking. Seems like a bunch of work when EVAL_RO ostensibly does the same thing.

(There is still the case where in cluster mode a coordinated failover can be triggered during a long running lua script, which periodically yields to the clusterbus, which might trigger a pause part way through. I think we still just want to fix that though)

oranagra · 2022-05-15T06:22:36Z

@madolson what i meant in [3] is that same as EVAL_RO is allowed to run (get executed) while writes are paused, we should be allowing the same for normal EVAL if the script declared the no-writes flag.
it may be a little complicated to achieve (we may need to remove the may-replicate from EVAL, and instead do that check after parsing the shebang flags.
but anyway, putting aside the implementation, the idea is that we don't need to force the user to use a special command to declare what the script does, the script (or function) is the one that declares what it does. i'm sure it'll be easier for users to keep using plain EVAL and have it implicitly working when it can.

oranagra · 2022-05-16T07:37:41Z

we discussed this on a core-team meeting, and our conclusions are:

for scripts, we want to consider the may-replicate commands as write commands (the applies to all concerns around write commands)
block may-replicate commands in EVAL_RO always (even in scripts without shebang)
allow scripts flagged with no-write to be executed with normal EVAL and FCALL during CLIENT PAUSE WRITE
hide the may-replicate flag from COMMAND command output (it's an internal implementation detail we rather not expose).

background:

we don't want to expose a no-may-replicate flag or alike to scripts, since we consider the may-replicate thing an internal concern of redis, that we may some day get rid of.
in fact, the may-replicate flag was initially introduced to flag EVAL: since we didn't know what it's gonna do ahead of execution, before function-flags existed). PUBLISH and PFCOUNT, both of which because they have side effects which may some day be fixed differently.

@guybe7 @MeirShpilraien FYI

oranagra · 2022-05-16T11:31:15Z

allow scripts flagged with no-write to be executed with normal EVAL and FCALL during CLIENT PAUSE WRITE

this is a little bit complicated.
we have two options:

allow the script to start running (call call() and cmd->proc()), and then abort the script half way (in scriptPrepareForRun) and in some way reset the client to be paused as if the command is still pending. this is complicated and there's the possibility for many side effects.
let processCommand look into the script (fetch the script from the dict if needed, or parse it, and run part of scriptPrepareForRun), so that the script flags are available at an early stage. this could have many other benefits in the future.

soloestoy · 2022-05-17T03:24:57Z

let processCommand look into the script (fetch the script from the dict if needed, or parse it, and run part of scriptPrepareForRun), so that the script flags are available at an early stage. this could have many other benefits in the future.

vote for option 2

oranagra · 2022-05-17T05:24:18Z

yes, the obvious choice.. i already started working on it. trying to decide if it should be a very generic mechanism, a dirty hack, or something in between.

oranagra · 2022-05-18T13:36:17Z

#10744 feedback is welcome.

MeirShpilraien reviewed Mar 1, 2022

View reviewed changes

src/script.c Outdated Show resolved Hide resolved

oranagra added the 7.0-must-have label Mar 1, 2022

oranagra approved these changes Mar 1, 2022

View reviewed changes

src/script.c Show resolved Hide resolved

script should not allow may-replicate commands when client pause write

bc308c3

soloestoy force-pushed the fix-may-repl-in-ro-script branch from ad1c13b to bc308c3 Compare March 2, 2022 03:28

oranagra merged commit 728e625 into redis:unstable Mar 8, 2022

oranagra mentioned this pull request Apr 5, 2022

Release 7.0 rc3 #10532

Merged

oranagra mentioned this pull request May 18, 2022

Expose script flags to processCommand for better handling #10744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

script should not allow may-replicate commands when client pause write #10364

script should not allow may-replicate commands when client pause write #10364

soloestoy commented Mar 1, 2022

oranagra left a comment

oranagra commented Mar 8, 2022

soloestoy commented May 12, 2022 •

edited

Loading

oranagra commented May 12, 2022

oranagra commented May 12, 2022

madolson commented May 15, 2022

oranagra commented May 15, 2022

oranagra commented May 16, 2022

oranagra commented May 16, 2022

soloestoy commented May 17, 2022

oranagra commented May 17, 2022

oranagra commented May 18, 2022

script should not allow may-replicate commands when client pause write #10364

script should not allow may-replicate commands when client pause write #10364

Conversation

soloestoy commented Mar 1, 2022

oranagra left a comment

Choose a reason for hiding this comment

oranagra commented Mar 8, 2022

soloestoy commented May 12, 2022 • edited Loading

oranagra commented May 12, 2022

oranagra commented May 12, 2022

madolson commented May 15, 2022

oranagra commented May 15, 2022

oranagra commented May 16, 2022

oranagra commented May 16, 2022

soloestoy commented May 17, 2022

oranagra commented May 17, 2022

oranagra commented May 18, 2022

soloestoy commented May 12, 2022 •

edited

Loading