[NEW] Auto-remediation of long running scripts where possible #11215

madolson · 2022-08-30T22:26:08Z

The problem/use-case that the feature addresses

At AWS, we automatically kill scripts after the timeout is exceeded (if it's a read-only script) to limit the impact of infinite scripts. We imagine that other users of Redis do the same thing, since few systems have built in remediation for a long running lua script. While integrating with Redis 7, we found that there are two separate commands you have to issue to kill an eval script vs a function, and the only way to determine which one to execute is by parsing the error message. There was some discussion about this, #9780 (comment), which I think missed a couple of points which were later discussed in #9655, namely that most applications never really know what to do with -BUSY so we should have a simple flow for deciding what to do with it.

Description of the feature

I would like to propose a config flag that automatically kills long running lua scripts that are RO, that way when we don't need to try to ascertain what is going on with the engine. If we get a -BUSY response back, we can simplify the decision making progress and not have to worry about issuing a script kill. Also, we want to automatically return a more descriptive error message so we are not wondering whether or not something is un-killable.

We also want to consider implementing code so that we can perform a coordinated failover when an infinite unkillable script is encountered, so that we can make a best effort to flush pending output buffers to a replica.

Alternatives you've considered

Allow SCRIPT KILL to also kill functions. In a sense this is now a backwards breaking change, hence the new approach.

The text was updated successfully, but these errors were encountered:

oranagra · 2022-08-31T06:10:11Z

@madolson in your experience, is it considerably more common for a read only script to get hung than one that performs writes?
i'm not sure there's a lot of value in that auto-kill mechanism.
it could be nice to have more details in the original BUSY command to indicate if the script is at all killable.
I also see the benefit of a single KILL command that can kill anything, but considering we're talking about software that runs that command, and the fact it can make a destruction based on the error message, i'm not sure the advantage overcomes the possible confusion it could create (too many similar commands with confusing names)

madolson · 2022-08-31T16:40:00Z

@oranagra The data I have says about 20% of all scripts we try to kill were killable. There might be some skew towards killable scripts though, since unkillable scripts require us to kill the process and, may, result in data loss which may prevent the unkillable situation from re-emerging. Killable scripts might just result in the user sending the same script again.

i'm not sure there's a lot of value in that auto-kill mechanism.

I think the main value is that is just simplifies the decision tree. If you get a "BUSY" back there is just one decision, wait to see if it completes or kill the process. It's also a pretty trivial feature to implement, and if we ever get fancy with lua rollback that can also potentially be added to this enum of options.

it could be nice to have more details in the original BUSY command to indicate if the script is at all killable.

Yeah, I think this is good regardless.

I also see the benefit of a single KILL command that can kill anything, but considering we're talking about software that runs that command, and the fact it can make a destruction based on the error message, i'm not sure the advantage overcomes the possible confusion it could create (too many similar commands with confusing names)

That is why I wanted to go with the auto-kill, it removes this type of confusion.

enjoy-binbin · 2023-06-07T10:55:50Z

I once wanted to add a force option to script kill

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Auto-remediation of long running scripts where possible #11215

[NEW] Auto-remediation of long running scripts where possible #11215

madolson commented Aug 30, 2022

oranagra commented Aug 31, 2022

madolson commented Aug 31, 2022 •

edited

Loading

enjoy-binbin commented Jun 7, 2023

[NEW] Auto-remediation of long running scripts where possible #11215

[NEW] Auto-remediation of long running scripts where possible #11215

Comments

madolson commented Aug 30, 2022

oranagra commented Aug 31, 2022

madolson commented Aug 31, 2022 • edited Loading

enjoy-binbin commented Jun 7, 2023

madolson commented Aug 31, 2022 •

edited

Loading