Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Auto-remediation of long running scripts where possible #11215

Open
madolson opened this issue Aug 30, 2022 · 3 comments
Open

[NEW] Auto-remediation of long running scripts where possible #11215

madolson opened this issue Aug 30, 2022 · 3 comments

Comments

@madolson
Copy link
Contributor

The problem/use-case that the feature addresses

At AWS, we automatically kill scripts after the timeout is exceeded (if it's a read-only script) to limit the impact of infinite scripts. We imagine that other users of Redis do the same thing, since few systems have built in remediation for a long running lua script. While integrating with Redis 7, we found that there are two separate commands you have to issue to kill an eval script vs a function, and the only way to determine which one to execute is by parsing the error message. There was some discussion about this, #9780 (comment), which I think missed a couple of points which were later discussed in #9655, namely that most applications never really know what to do with -BUSY so we should have a simple flow for deciding what to do with it.

Description of the feature

I would like to propose a config flag that automatically kills long running lua scripts that are RO, that way when we don't need to try to ascertain what is going on with the engine. If we get a -BUSY response back, we can simplify the decision making progress and not have to worry about issuing a script kill. Also, we want to automatically return a more descriptive error message so we are not wondering whether or not something is un-killable.

We also want to consider implementing code so that we can perform a coordinated failover when an infinite unkillable script is encountered, so that we can make a best effort to flush pending output buffers to a replica.

Alternatives you've considered

Allow SCRIPT KILL to also kill functions. In a sense this is now a backwards breaking change, hence the new approach.

@oranagra
Copy link
Member

@madolson in your experience, is it considerably more common for a read only script to get hung than one that performs writes?
i'm not sure there's a lot of value in that auto-kill mechanism.
it could be nice to have more details in the original BUSY command to indicate if the script is at all killable.
I also see the benefit of a single KILL command that can kill anything, but considering we're talking about software that runs that command, and the fact it can make a destruction based on the error message, i'm not sure the advantage overcomes the possible confusion it could create (too many similar commands with confusing names)

@madolson
Copy link
Contributor Author

madolson commented Aug 31, 2022

@oranagra The data I have says about 20% of all scripts we try to kill were killable. There might be some skew towards killable scripts though, since unkillable scripts require us to kill the process and, may, result in data loss which may prevent the unkillable situation from re-emerging. Killable scripts might just result in the user sending the same script again.

i'm not sure there's a lot of value in that auto-kill mechanism.

I think the main value is that is just simplifies the decision tree. If you get a "BUSY" back there is just one decision, wait to see if it completes or kill the process. It's also a pretty trivial feature to implement, and if we ever get fancy with lua rollback that can also potentially be added to this enum of options.

it could be nice to have more details in the original BUSY command to indicate if the script is at all killable.

Yeah, I think this is good regardless.

I also see the benefit of a single KILL command that can kill anything, but considering we're talking about software that runs that command, and the fact it can make a destruction based on the error message, i'm not sure the advantage overcomes the possible confusion it could create (too many similar commands with confusing names)

That is why I wanted to go with the auto-kill, it removes this type of confusion.

@enjoy-binbin
Copy link
Collaborator

I once wanted to add a force option to script kill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants