Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CircuitBreaker stuck in half open #903

Closed
f2014440 opened this issue Mar 5, 2020 · 35 comments
Closed

CircuitBreaker stuck in half open #903

f2014440 opened this issue Mar 5, 2020 · 35 comments

Comments

@f2014440
Copy link

f2014440 commented Mar 5, 2020

Resilience4j version:
1.2.0
Java version:
11.0.2
Micronaut version:
1.2.9 RELEASE

We are using "TIME BASED" sliding window type. So when we are running load test, most of the time our pod is getting stuck in' half_open' state and not allowing any calls to backend.
We observed that whenever pod gets stuck in half_open state, metric named
resilience4j_circuitbreaker_buffered_calls{kind="failed",name="store-price",ready="true",} 8.0
doesn't refresh with the new values and continue to retain the same number which in return I suspect does not trigger the calculation of the error rate in half open state. Observation is buffered calls gets stuck at a number less than the number of minimum call.
These are the configuration we have for Time Based sliding window:
failure-rate-threshold: 25
wait-duration-in-open-state: 10s
sliding-window-type: TIME_BASED
sliding-window-size: 1
minimum-number-of-calls: 10
wait-duration-in-open-state: 1000ms
permitted-number-of-calls-in-half-open-state: 10

It’s the same behavior when we switched to the count_based configuration. Below is the configuration we have used.
failure-rate-threshold": "100",
sliding-window-type": "COUNT_BASED",
sliding-window-size": "100",
minimum-number-of-calls": "100",
wait-duration-in-open-state": "10s",
permitted-number-of-calls-in-half-open-state": "10"

Appreciate any help !!!

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Did you update to v1.3.1?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

No, was it known issue in v1.2.0?

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

In your configuration only 10 calls are permitted in half-open state.
That means the CircuitBreaker has an atomic integer counter and only permits 10 calls and waits for the call results. If calls are cancelled, you have to release a permission. Could it be that some of your remote calls are stuck or are cancelled?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

So if the calls are stuck, would CircuitBreaker wait forever for those calls? If No, then it should allow more calls in the next window and calculate the error rate.

If I upgrade to 1.3.1, would it help?

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

No, an upgrade to v1.3.1 does not help with that issue, but still you should always upgrade to the latest version, because it contains other bug fixes.

The half open state has no rolling time window. Hystrix permits only 1 test calls, Resilience4j allows multiple calls, but you should make sure that they never get stuck. You could reduce permitted-number-of-calls-in-half-open-state to 1 or 3 and see if your problem still occurs.

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

I reduced "permitted-number-of-calls-in-half-open-state" to 5 then it got stuck at 4. Earlier we were using 0.13.2 and we didn't face this issue.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

It seems one of your calls gets stuck. Can you trace it?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

We have set timeout for our backend. If the call is stuck we consider it as a request timeout.

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Would circuitbreaker wait forever for stuck calls to complete because it doesn't have a rolling time window in half_open state?

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Currently yes.

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Can we skip that window somehow?

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

What do you mean by skip window? Which window?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Current sliding window in half_open state which got stuck and move to next window where it will allow more calls to go to backend and recalculate the error rate.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

No, the HALF-OPEN state has a fixed-size COUNT_BASED sliding window which cannot be skipped.
Currently have to make sure that no calls get stuck. If they get stuck you have to cancel them and release the permission.

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Can we somehow detect that window is stuck and release the permission?

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

I think first you should investigate if a call gets really stuck.
You could register an event consumer and consume the state transition event from OPEN to HALF_OPEN and transition back to OPEN when after a certain period it doesn't do automatically.

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Could there be other reason because we didn't face this issue in v0.13.2?
And to make sure that our call doesn't get stuck we have request timeout for our backends.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Yes, this behavior was introduced in v0.15.0 and was requested by many users.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Could you check if a TimeoutExceptions is thrown somewhere?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Yes we throw TimeoutExcetions in case we timed out.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

I'm asking of you can see it in your logs during your load test ;)

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

Yes, I am seeing "io.micronaut.http.client.exceptions.ReadTimeoutException".

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Is above your full configuration or do you ignore certain exceptions?

@f2014440
Copy link
Author

f2014440 commented Mar 5, 2020

No, these are all the configurations that we set.

@RobWin
Copy link
Member

RobWin commented Mar 5, 2020

Could you please attach event consumers to your CircuitBreaker and check the log output?

circuitBreaker.getEventPublisher()
    .onEvent(event -> logger.info(event.toString()));

It might help us to understand how many events are published after the state transition from OPEN to HALF_OPEN.

@f2014440
Copy link
Author

f2014440 commented Mar 9, 2020

Do you guys have any plans to implement rolling time window functionality which would resolve this issue?

@RobWin
Copy link
Member

RobWin commented Mar 9, 2020

There are no plans yet.
Do you have the log output?

@echozdog
Copy link

@RobWin We are working on the logs in stage. Looking deeper into this when we call tryAcquirePermission we have a Subscriber that calls circuitBreaker.onSuccess every onNext and in the onError it either calls circuitBreaker.onSuccess or circuitBreaker.onError depending on the status code. I assume these will release the permissions. What you have been describing is a possible way we are not calling these? yes?

@RobWin
Copy link
Member

RobWin commented Mar 10, 2020

Did you implement your own decorator for Reactor or RxJava?

@echozdog
Copy link

I believe we are using reactive streams. I'll talk to my principle about this tomorrow.

@RobWin
Copy link
Member

RobWin commented Mar 11, 2020

If you really need your own reactive streams operator, please look at our implementation. Otherwise I can't help you without knowing your implementation.

@echozdog
Copy link

We are running inside Micronaut. We add a HttpClientFilter intercepts every backend call. It calls tryAcquirePermission and if it return true we all the call if not we return 423 without making the call. What seems to be happening is the CB transitions to a half open state at some point but tryAcquirePermission never returns true. If tryAcquirePermission allows the call we will call OnError or onSuccess on the CB. So we should be releasing the permission. I'm trying to find a way to make a functional test to show that but it's been difficult.

@RobWin
Copy link
Member

RobWin commented Mar 12, 2020

Can you show me the code of your Filter?

@echozdog
Copy link

echozdog commented Mar 13, 2020

We are trying using your publisher

  @Override
  public Publisher<? extends HttpResponse<?>> doFilter(MutableHttpRequest<?> request, ClientFilterChain chain) {
    String backendName = request.getAttribute(HttpAttributes.SERVICE_ID, String.class).orElseGet(() -> "unknown");
    CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(backendName);
    CircuitBreakerOperator<HttpResponse<?>> circuitBreakerOperator = CircuitBreakerOperator.of(circuitBreaker);
    return circuitBreakerOperator.apply(Flowable.fromPublisher(chain.proceed(request)));
  }

This seems to fix the problem. The metrics still get stuck for 45 seconds. I'm gussing this is the back off period. Is that configurable?

@RobWin
Copy link
Member

RobWin commented Mar 23, 2020

Hi,
you don't have to create an CircuitBreakerOperator for every call. You can move the creation into the constructor of your filter.

Could you please use:

return Flowable.fromPublisher(chain.proceed(request)).transform(circuitBreakerOperator);

@RobWin RobWin closed this as completed Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants