You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inside of the scaler logic we have the notion of a BackoffAwareScaler, which approached exponential backoffs in a very naïve way. Basically, for specific commands that might take a long time (ProviderStarted, ComponentScaled) we would prevent that scaler from sending commands out either for 30 seconds or until we receive an event that is specifically in response to that scaler (for a provider start command, we'd expect for that provider to either start or fail to start with a corresponding event.)
The real problem we're trying to solve here is preventing a scaler from thrashing in response to events that might be relevant. What isn't solved for here is the more generic problem of thrashing in response to events that are relevant. Imagine the scenario where a scaler is attempting to start a Wasm component that is in a private registry, and the wasmCloud host does not have credentials. The scaler publishes the command, the host nearly immediately fails to authenticate, and a component_scale_failed event is emitted. That scaler sees that the component failed to scale, and being the dumb scaler that it is (doesn't look at the error type) immediately tries to restart it. Rust is fast, and we'll be retrying this forever or until someone notices the increased load.
My proposal for this is to have every scaler wrapped in the BackoffAware structure, where external to the scaler logic we can have an internal backoff timer for repeated commands. We want to make sure that the individual scaler is able to reconcile immediately in the case where state is actually modified, but in the case where it's attempting hopelessly to perform the same command over and over we can have an exponential (power of two, Fibonacci, etc) backoff for sending out that next command.
The text was updated successfully, but these errors were encountered:
Inside of the scaler logic we have the notion of a
BackoffAwareScaler
, which approached exponential backoffs in a very naïve way. Basically, for specific commands that might take a long time (ProviderStarted, ComponentScaled) we would prevent that scaler from sending commands out either for 30 seconds or until we receive an event that is specifically in response to that scaler (for a provider start command, we'd expect for that provider to either start or fail to start with a corresponding event.)The real problem we're trying to solve here is preventing a scaler from thrashing in response to events that might be relevant. What isn't solved for here is the more generic problem of thrashing in response to events that are relevant. Imagine the scenario where a scaler is attempting to start a Wasm component that is in a private registry, and the wasmCloud host does not have credentials. The scaler publishes the command, the host nearly immediately fails to authenticate, and a
component_scale_failed
event is emitted. That scaler sees that the component failed to scale, and being the dumb scaler that it is (doesn't look at the error type) immediately tries to restart it. Rust is fast, and we'll be retrying this forever or until someone notices the increased load.My proposal for this is to have every scaler wrapped in the BackoffAware structure, where external to the scaler logic we can have an internal backoff timer for repeated commands. We want to make sure that the individual scaler is able to reconcile immediately in the case where state is actually modified, but in the case where it's attempting hopelessly to perform the same command over and over we can have an exponential (power of two, Fibonacci, etc) backoff for sending out that next command.
The text was updated successfully, but these errors were encountered: