Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script should not allow may-replicate commands when client pause write #10364

Merged
merged 1 commit into from
Mar 8, 2022

Conversation

soloestoy
Copy link
Collaborator

In some special commands like eval_ro/fcall_ro we allow no-writes commands. But may-replicate commands are no-writes too, that leads crash when client pause write:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
51328:M 01 Mar 2022 17:08:59.273 # === ASSERTION FAILED ===
51328:M 01 Mar 2022 17:08:59.273 # ==> server.c:3036 '!(areClientsPaused() && !server.client_pause_in_transaction)' is not true

------ STACK TRACE ------

Backtrace:
./redis-server *:6789[0x447127]
./redis-server *:6789(propagatePendingCommands+0x72)[0x44ab82]
./redis-server *:6789(afterCommand+0x33)[0x44ae53]
./redis-server *:6789(call+0x31a)[0x44b17a]
./redis-server *:6789(processCommand+0x990)[0x44cf90]
./redis-server *:6789(processCommandAndResetClient+0x1c)[0x462a5c]
./redis-server *:6789(processInputBuffer+0xd1)[0x465541]
./redis-server *:6789(readQueryFromClient+0x238)[0x468558]
./redis-server *:6789[0x4fd373]
./redis-server *:6789(aeProcessEvents+0x235)[0x443735]
./redis-server *:6789(aeMain+0x1d)[0x443a7d]
./redis-server *:6789(main+0x406)[0x43fe16]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f509184f445]
./redis-server *:6789[0x4401cd]

...

------ CURRENT CLIENT INFO ------
id=3 addr=127.0.0.1:46466 laddr=127.0.0.1:6789 fd=7 name= age=3 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=67 qbuf-free=20407 argv-mem=44 multi-mem=0 rbs=1024 rbp=5 obl=4 oll=0 omem=0 tot-mem=22340 events=r cmd=eval_ro user=default redir=-1 resp=2
argv[0]: '"eval_ro"'
argv[1]: '"return redis.call('publish','ch','msg')"'
argv[2]: '"0"'

src/script.c Outdated Show resolved Hide resolved
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after applying the suggestions, you'll probably need to update the test too.

src/script.c Show resolved Hide resolved
@oranagra oranagra merged commit 728e625 into redis:unstable Mar 8, 2022
@oranagra
Copy link
Member

oranagra commented Mar 8, 2022

@soloestoy @madolson i merged this and figured it doesn't need to be mentioned in the release notes since this problem is new to redis 7.0 (EVAL_RO is new).

But it made me realize that maybe our decision in 6.2 to flag EVAL as may-replicate was not wise, and instead we should have relied on a check like the one added in this PR.
i.e. it means that someone will be able to keep running scripts with solely read-only commands using EVAL (rather than forcing them to use EVAL_RO).
WDYT?

reminding us that the main reason for adding EVAL_RO was for the command flags, so that cluster aware clients can propagate that to replicas too.

@oranagra oranagra mentioned this pull request Apr 5, 2022
@soloestoy
Copy link
Collaborator Author

soloestoy commented May 12, 2022

But it made me realize that maybe our decision in 6.2 to flag EVAL as may-replicate was not wise, and instead we should have relied on a check like the one added in this PR.
i.e. it means that someone will be able to keep running scripts with solely read-only commands using EVAL (rather than forcing them to use EVAL_RO).

not sure which way is better, but seems we have different behaviors between EVAL and EVAL_RO when write paused:

  1. eval "redis.call('publish','c','m')" 0 would be blocked but succeed after unpaused.
  2. eval_ro "redis.call('publish','c','m')" 0 would failed.

BTW, transaction with write or may-replicate command performs like eval_ro.

I think we should unify the behavior about EVAL and EVAL_RO (FCALL and FCALL_RO), but not sure which one is better, WDYT @oranagra @madolson

@oranagra
Copy link
Member

BTW, transaction with write or may-replicate command performs like eval_ro.

yes, but the difference is that it won't fail all transactions, just ones with write / may-replicate commands.
for EVAL, we'll fail any attempt to run a script, even one with solely read-only commands.
but obviously the other difference is that in a transactions we know in advance which commands they use, and for scripts we could fail the script after some commands were already executed (breaking it's atomicity)

maybe that means the fix in this PR is bad and instead we need to consistently block may-replicate commands in EVAL_RO (so that we don't fail the script half way)?
and maybe add shebang flags to control it? i.e. a script that declared a shebang no-replicate will be allowed to run (with both EVAL and EVAL_RO) during client pause..

@MeirShpilraien @yossigo please join this discussion as well.

@oranagra
Copy link
Member

I've discussed this with Yossi, we propose the following:

  1. consider PFCOUNT and PUBLISH (the may-replicate commands, other than EVAL) as write commands in scripts, this means that they can never be used in scripts with shebang and no no-writes flag, and also not in the _RO command variants.
  2. block them in eval_RO always (even in scripts without shebang)
  3. allow no-write scripts in EVAL (not just in EVAL_RO), even during client pause writes.
    i.e. if the script declared its intention ahead of time, we don't risk breaking atomicity by stopping it in the middle, and we consider the intention declaration of the script the same (or stronger) as the intention declaration of the caller (which command is being used)

@madolson please join this discussion, in the context of #8820, i think it means again that the main purpose of EVAL_RO is for client library routing.

@madolson
Copy link
Contributor

Sorry I've been under rock.

  1. I strongly agree blocking PFCOUNT and PUBLISH (and all may-replicate) from EVAL_RO scripts as well as "no-write" scripts. I think any mechanism that enforces that is sufficient, but I think blocking them is the best approach from all scripts.
  2. Yes
  3. I'm not entirely sure what this means. Normally* a client pause can't be started midway through the script, since it can't call any of the commands that can start the pause, so I assume you mean starting a no-write EVAL during the pause. The client pause check happens in processCommand(), which is before we've parsed the lua script. There is a bunch of new accounting we have to keep track of since they now act more like BPOP blocking. Seems like a bunch of work when EVAL_RO ostensibly does the same thing.
  • (There is still the case where in cluster mode a coordinated failover can be triggered during a long running lua script, which periodically yields to the clusterbus, which might trigger a pause part way through. I think we still just want to fix that though)

@oranagra
Copy link
Member

@madolson what i meant in [3] is that same as EVAL_RO is allowed to run (get executed) while writes are paused, we should be allowing the same for normal EVAL if the script declared the no-writes flag.
it may be a little complicated to achieve (we may need to remove the may-replicate from EVAL, and instead do that check after parsing the shebang flags.
but anyway, putting aside the implementation, the idea is that we don't need to force the user to use a special command to declare what the script does, the script (or function) is the one that declares what it does. i'm sure it'll be easier for users to keep using plain EVAL and have it implicitly working when it can.

@oranagra
Copy link
Member

we discussed this on a core-team meeting, and our conclusions are:

  • for scripts, we want to consider the may-replicate commands as write commands (the applies to all concerns around write commands)
  • block may-replicate commands in EVAL_RO always (even in scripts without shebang)
  • allow scripts flagged with no-write to be executed with normal EVAL and FCALL during CLIENT PAUSE WRITE
  • hide the may-replicate flag from COMMAND command output (it's an internal implementation detail we rather not expose).

background:

  • we don't want to expose a no-may-replicate flag or alike to scripts, since we consider the may-replicate thing an internal concern of redis, that we may some day get rid of.
  • in fact, the may-replicate flag was initially introduced to flag EVAL: since we didn't know what it's gonna do ahead of execution, before function-flags existed). PUBLISH and PFCOUNT, both of which because they have side effects which may some day be fixed differently.

@guybe7 @MeirShpilraien FYI

@oranagra
Copy link
Member

allow scripts flagged with no-write to be executed with normal EVAL and FCALL during CLIENT PAUSE WRITE

this is a little bit complicated.
we have two options:

  1. allow the script to start running (call call() and cmd->proc()), and then abort the script half way (in scriptPrepareForRun) and in some way reset the client to be paused as if the command is still pending. this is complicated and there's the possibility for many side effects.
  2. let processCommand look into the script (fetch the script from the dict if needed, or parse it, and run part of scriptPrepareForRun), so that the script flags are available at an early stage. this could have many other benefits in the future.

@soloestoy
Copy link
Collaborator Author

  1. let processCommand look into the script (fetch the script from the dict if needed, or parse it, and run part of scriptPrepareForRun), so that the script flags are available at an early stage. this could have many other benefits in the future.

vote for option 2

@oranagra
Copy link
Member

yes, the obvious choice.. i already started working on it. trying to decide if it should be a very generic mechanism, a dirty hack, or something in between.

@oranagra
Copy link
Member

#10744 feedback is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants