Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support exponential backoff in the wrapped BackoffAwareScaler type #253

Open
brooksmtownsend opened this issue Apr 4, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@brooksmtownsend
Copy link
Member

Inside of the scaler logic we have the notion of a BackoffAwareScaler, which approached exponential backoffs in a very naïve way. Basically, for specific commands that might take a long time (ProviderStarted, ComponentScaled) we would prevent that scaler from sending commands out either for 30 seconds or until we receive an event that is specifically in response to that scaler (for a provider start command, we'd expect for that provider to either start or fail to start with a corresponding event.)

The real problem we're trying to solve here is preventing a scaler from thrashing in response to events that might be relevant. What isn't solved for here is the more generic problem of thrashing in response to events that are relevant. Imagine the scenario where a scaler is attempting to start a Wasm component that is in a private registry, and the wasmCloud host does not have credentials. The scaler publishes the command, the host nearly immediately fails to authenticate, and a component_scale_failed event is emitted. That scaler sees that the component failed to scale, and being the dumb scaler that it is (doesn't look at the error type) immediately tries to restart it. Rust is fast, and we'll be retrying this forever or until someone notices the increased load.

My proposal for this is to have every scaler wrapped in the BackoffAware structure, where external to the scaler logic we can have an internal backoff timer for repeated commands. We want to make sure that the individual scaler is able to reconcile immediately in the case where state is actually modified, but in the case where it's attempting hopelessly to perform the same command over and over we can have an exponential (power of two, Fibonacci, etc) backoff for sending out that next command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant