CircuitBreaker stuck in half open #903

f2014440 · 2020-03-05T08:33:48Z

Resilience4j version:
1.2.0
Java version:
11.0.2
Micronaut version:
1.2.9 RELEASE

We are using "TIME BASED" sliding window type. So when we are running load test, most of the time our pod is getting stuck in' half_open' state and not allowing any calls to backend.
We observed that whenever pod gets stuck in half_open state, metric named
resilience4j_circuitbreaker_buffered_calls{kind="failed",name="store-price",ready="true",} 8.0
doesn't refresh with the new values and continue to retain the same number which in return I suspect does not trigger the calculation of the error rate in half open state. Observation is buffered calls gets stuck at a number less than the number of minimum call.
These are the configuration we have for Time Based sliding window:
failure-rate-threshold: 25
wait-duration-in-open-state: 10s
sliding-window-type: TIME_BASED
sliding-window-size: 1
minimum-number-of-calls: 10
wait-duration-in-open-state: 1000ms
permitted-number-of-calls-in-half-open-state: 10

It’s the same behavior when we switched to the count_based configuration. Below is the configuration we have used.
failure-rate-threshold": "100",
sliding-window-type": "COUNT_BASED",
sliding-window-size": "100",
minimum-number-of-calls": "100",
wait-duration-in-open-state": "10s",
permitted-number-of-calls-in-half-open-state": "10"

Appreciate any help !!!

RobWin · 2020-03-05T08:56:01Z

Did you update to v1.3.1?

f2014440 · 2020-03-05T09:00:59Z

No, was it known issue in v1.2.0?

RobWin · 2020-03-05T09:26:55Z

In your configuration only 10 calls are permitted in half-open state.
That means the CircuitBreaker has an atomic integer counter and only permits 10 calls and waits for the call results. If calls are cancelled, you have to release a permission. Could it be that some of your remote calls are stuck or are cancelled?

f2014440 · 2020-03-05T09:38:03Z

So if the calls are stuck, would CircuitBreaker wait forever for those calls? If No, then it should allow more calls in the next window and calculate the error rate.

If I upgrade to 1.3.1, would it help?

RobWin · 2020-03-05T09:44:13Z

No, an upgrade to v1.3.1 does not help with that issue, but still you should always upgrade to the latest version, because it contains other bug fixes.

The half open state has no rolling time window. Hystrix permits only 1 test calls, Resilience4j allows multiple calls, but you should make sure that they never get stuck. You could reduce permitted-number-of-calls-in-half-open-state to 1 or 3 and see if your problem still occurs.

f2014440 · 2020-03-05T09:48:45Z

I reduced "permitted-number-of-calls-in-half-open-state" to 5 then it got stuck at 4. Earlier we were using 0.13.2 and we didn't face this issue.

RobWin · 2020-03-05T09:49:22Z

It seems one of your calls gets stuck. Can you trace it?

f2014440 · 2020-03-05T09:53:19Z

We have set timeout for our backend. If the call is stuck we consider it as a request timeout.

f2014440 · 2020-03-05T10:04:33Z

Would circuitbreaker wait forever for stuck calls to complete because it doesn't have a rolling time window in half_open state?

RobWin · 2020-03-05T10:05:46Z

Currently yes.

f2014440 · 2020-03-05T10:06:39Z

Can we skip that window somehow?

RobWin · 2020-03-05T10:07:04Z

What do you mean by skip window? Which window?

f2014440 · 2020-03-05T10:09:03Z

Current sliding window in half_open state which got stuck and move to next window where it will allow more calls to go to backend and recalculate the error rate.

RobWin · 2020-03-05T10:11:12Z

No, the HALF-OPEN state has a fixed-size COUNT_BASED sliding window which cannot be skipped.
Currently have to make sure that no calls get stuck. If they get stuck you have to cancel them and release the permission.

f2014440 · 2020-03-05T10:12:43Z

Can we somehow detect that window is stuck and release the permission?

RobWin · 2020-03-05T10:20:25Z

I think first you should investigate if a call gets really stuck.
You could register an event consumer and consume the state transition event from OPEN to HALF_OPEN and transition back to OPEN when after a certain period it doesn't do automatically.

f2014440 · 2020-03-05T10:21:18Z

Could there be other reason because we didn't face this issue in v0.13.2?
And to make sure that our call doesn't get stuck we have request timeout for our backends.

RobWin · 2020-03-05T10:31:23Z

Yes, this behavior was introduced in v0.15.0 and was requested by many users.

RobWin · 2020-03-05T10:35:24Z

Could you check if a TimeoutExceptions is thrown somewhere?

f2014440 · 2020-03-05T10:42:03Z

Yes we throw TimeoutExcetions in case we timed out.

RobWin · 2020-03-05T10:42:31Z

I'm asking of you can see it in your logs during your load test ;)

f2014440 · 2020-03-05T10:57:36Z

Yes, I am seeing "io.micronaut.http.client.exceptions.ReadTimeoutException".

RobWin · 2020-03-05T11:59:40Z

Is above your full configuration or do you ignore certain exceptions?

f2014440 · 2020-03-05T12:26:11Z

No, these are all the configurations that we set.

RobWin · 2020-03-05T12:40:38Z

Could you please attach event consumers to your CircuitBreaker and check the log output?

circuitBreaker.getEventPublisher()
    .onEvent(event -> logger.info(event.toString()));

It might help us to understand how many events are published after the state transition from OPEN to HALF_OPEN.

f2014440 · 2020-03-09T11:46:04Z

Do you guys have any plans to implement rolling time window functionality which would resolve this issue?

RobWin · 2020-03-09T13:05:18Z

There are no plans yet.
Do you have the log output?

echozdog · 2020-03-10T16:34:38Z

@RobWin We are working on the logs in stage. Looking deeper into this when we call tryAcquirePermission we have a Subscriber that calls circuitBreaker.onSuccess every onNext and in the onError it either calls circuitBreaker.onSuccess or circuitBreaker.onError depending on the status code. I assume these will release the permissions. What you have been describing is a possible way we are not calling these? yes?

RobWin · 2020-03-10T16:51:24Z

Did you implement your own decorator for Reactor or RxJava?

echozdog · 2020-03-10T21:13:50Z

I believe we are using reactive streams. I'll talk to my principle about this tomorrow.

RobWin · 2020-03-11T07:10:30Z

If you really need your own reactive streams operator, please look at our implementation. Otherwise I can't help you without knowing your implementation.

echozdog · 2020-03-11T19:36:27Z

We are running inside Micronaut. We add a HttpClientFilter intercepts every backend call. It calls tryAcquirePermission and if it return true we all the call if not we return 423 without making the call. What seems to be happening is the CB transitions to a half open state at some point but tryAcquirePermission never returns true. If tryAcquirePermission allows the call we will call OnError or onSuccess on the CB. So we should be releasing the permission. I'm trying to find a way to make a functional test to show that but it's been difficult.

RobWin · 2020-03-12T07:36:13Z

Can you show me the code of your Filter?

echozdog · 2020-03-13T14:13:35Z

We are trying using your publisher

  @Override
  public Publisher<? extends HttpResponse<?>> doFilter(MutableHttpRequest<?> request, ClientFilterChain chain) {
    String backendName = request.getAttribute(HttpAttributes.SERVICE_ID, String.class).orElseGet(() -> "unknown");
    CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(backendName);
    CircuitBreakerOperator<HttpResponse<?>> circuitBreakerOperator = CircuitBreakerOperator.of(circuitBreaker);
    return circuitBreakerOperator.apply(Flowable.fromPublisher(chain.proceed(request)));
  }

This seems to fix the problem. The metrics still get stuck for 45 seconds. I'm gussing this is the back off period. Is that configurable?

RobWin · 2020-03-23T12:01:38Z

Hi,
you don't have to create an CircuitBreakerOperator for every call. You can move the creation into the constructor of your filter.

Could you please use:

return Flowable.fromPublisher(chain.proceed(request)).transform(circuitBreakerOperator);

RobWin added the needs investigation label Mar 5, 2020

bankuruprasannakumar mentioned this issue Mar 23, 2020

Circuit breaker stuck in HALF_OPEN state #935

Closed

RobWin closed this as completed Jun 3, 2020

CircuitBreaker stuck in half open #903

CircuitBreaker stuck in half open #903

Comments

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020 • edited

RobWin commented Mar 5, 2020 • edited

f2014440 commented Mar 5, 2020 • edited

RobWin commented Mar 5, 2020 • edited

f2014440 commented Mar 5, 2020 • edited

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020 • edited

RobWin commented Mar 5, 2020 • edited

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 5, 2020

RobWin commented Mar 5, 2020

f2014440 commented Mar 9, 2020

RobWin commented Mar 9, 2020

echozdog commented Mar 10, 2020

RobWin commented Mar 10, 2020

echozdog commented Mar 10, 2020

RobWin commented Mar 11, 2020

echozdog commented Mar 11, 2020

RobWin commented Mar 12, 2020

echozdog commented Mar 13, 2020 • edited by RobWin

RobWin commented Mar 23, 2020 • edited

f2014440 commented Mar 5, 2020 •

edited

RobWin commented Mar 5, 2020 •

edited

f2014440 commented Mar 5, 2020 •

edited

RobWin commented Mar 5, 2020 •

edited

f2014440 commented Mar 5, 2020 •

edited

f2014440 commented Mar 5, 2020 •

edited

RobWin commented Mar 5, 2020 •

edited

echozdog commented Mar 13, 2020 •

edited by RobWin

RobWin commented Mar 23, 2020 •

edited