New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CircuitBreaker stuck in half open #903
Comments
Did you update to v1.3.1? |
No, was it known issue in v1.2.0? |
In your configuration only 10 calls are permitted in half-open state. |
So if the calls are stuck, would CircuitBreaker wait forever for those calls? If No, then it should allow more calls in the next window and calculate the error rate. If I upgrade to 1.3.1, would it help? |
No, an upgrade to v1.3.1 does not help with that issue, but still you should always upgrade to the latest version, because it contains other bug fixes. The half open state has no rolling time window. Hystrix permits only 1 test calls, Resilience4j allows multiple calls, but you should make sure that they never get stuck. You could reduce |
I reduced "permitted-number-of-calls-in-half-open-state" to 5 then it got stuck at 4. Earlier we were using 0.13.2 and we didn't face this issue. |
It seems one of your calls gets stuck. Can you trace it? |
We have set timeout for our backend. If the call is stuck we consider it as a request timeout. |
Would circuitbreaker wait forever for stuck calls to complete because it doesn't have a rolling time window in half_open state? |
Currently yes. |
Can we skip that window somehow? |
What do you mean by skip window? Which window? |
Current sliding window in half_open state which got stuck and move to next window where it will allow more calls to go to backend and recalculate the error rate. |
No, the HALF-OPEN state has a fixed-size COUNT_BASED sliding window which cannot be skipped. |
Can we somehow detect that window is stuck and release the permission? |
I think first you should investigate if a call gets really stuck. |
Could there be other reason because we didn't face this issue in v0.13.2? |
Yes, this behavior was introduced in v0.15.0 and was requested by many users. |
Could you check if a TimeoutExceptions is thrown somewhere? |
Yes we throw TimeoutExcetions in case we timed out. |
I'm asking of you can see it in your logs during your load test ;) |
Yes, I am seeing "io.micronaut.http.client.exceptions.ReadTimeoutException". |
Is above your full configuration or do you ignore certain exceptions? |
No, these are all the configurations that we set. |
Could you please attach event consumers to your CircuitBreaker and check the log output?
It might help us to understand how many events are published after the state transition from OPEN to HALF_OPEN. |
Do you guys have any plans to implement rolling time window functionality which would resolve this issue? |
There are no plans yet. |
@RobWin We are working on the logs in stage. Looking deeper into this when we call tryAcquirePermission we have a Subscriber that calls circuitBreaker.onSuccess every onNext and in the onError it either calls circuitBreaker.onSuccess or circuitBreaker.onError depending on the status code. I assume these will release the permissions. What you have been describing is a possible way we are not calling these? yes? |
Did you implement your own decorator for Reactor or RxJava? |
I believe we are using reactive streams. I'll talk to my principle about this tomorrow. |
If you really need your own reactive streams operator, please look at our implementation. Otherwise I can't help you without knowing your implementation. |
We are running inside Micronaut. We add a HttpClientFilter intercepts every backend call. It calls tryAcquirePermission and if it return true we all the call if not we return 423 without making the call. What seems to be happening is the CB transitions to a half open state at some point but tryAcquirePermission never returns true. If tryAcquirePermission allows the call we will call OnError or onSuccess on the CB. So we should be releasing the permission. I'm trying to find a way to make a functional test to show that but it's been difficult. |
Can you show me the code of your Filter? |
We are trying using your publisher
This seems to fix the problem. The metrics still get stuck for 45 seconds. I'm gussing this is the back off period. Is that configurable? |
Hi, Could you please use:
|
Resilience4j version:
1.2.0
Java version:
11.0.2
Micronaut version:
1.2.9 RELEASE
We are using "TIME BASED" sliding window type. So when we are running load test, most of the time our pod is getting stuck in' half_open' state and not allowing any calls to backend.
We observed that whenever pod gets stuck in half_open state, metric named
resilience4j_circuitbreaker_buffered_calls{kind="failed",name="store-price",ready="true",} 8.0
doesn't refresh with the new values and continue to retain the same number which in return I suspect does not trigger the calculation of the error rate in half open state. Observation is buffered calls gets stuck at a number less than the number of minimum call.
These are the configuration we have for Time Based sliding window:
failure-rate-threshold: 25
wait-duration-in-open-state: 10s
sliding-window-type: TIME_BASED
sliding-window-size: 1
minimum-number-of-calls: 10
wait-duration-in-open-state: 1000ms
permitted-number-of-calls-in-half-open-state: 10
It’s the same behavior when we switched to the count_based configuration. Below is the configuration we have used.
failure-rate-threshold": "100",
sliding-window-type": "COUNT_BASED",
sliding-window-size": "100",
minimum-number-of-calls": "100",
wait-duration-in-open-state": "10s",
permitted-number-of-calls-in-half-open-state": "10"
Appreciate any help !!!
The text was updated successfully, but these errors were encountered: