Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.StackOverflowError due to infinite loop in CircuitBreakerStateMachine$OpenState.tryAcquirePermission #2038

Open
polkosity opened this issue Oct 24, 2023 · 15 comments
Labels
Milestone

Comments

@polkosity
Copy link

polkosity commented Oct 24, 2023

Resilience4j version: 2.1.0

Java version: any

I've encountered java.lang.StackOverflowError due to infinite loop in CircuitBreakerStateMachine$OpenState.tryAcquirePermission in a production server, which was under strain.

After reviewing the code, I suggest it's caused by failure in the AtomicReference getAndUpdate call (CircuitBreakerStateMachine.java:333), as such failures can occurr due to contention among threads, according to Javadoc. As an aside, Javadoc also states that the updateFunction should be side-effect free, which is not the case here.

Once the updateFunction fails during an attempted transition from open to half open state, the AtomicReference CircuitBreakerStateMachine.stateReference still refers to an OpenState whose AtomicBoolean OpenState.isOpen has already been set to false (CircuitBreakerStateMachine.java:812), so the transitionToHalfOpenState() call will not be attempted again, additionally retryAfterWaitDuration is already in the past for this OpenState object. The effect is that OpenState.tryAcquirePermission calls itself recursively (CircuitBreakerStateMachine.java:737) until the StackOverflowError occurs.

Please see the attached screenshot displaying a partial stack trace of the StackOverflowError, which was generated under resilience4j-circuitbreaker v1.7.1, but the relevant code sections have not changed since then.

StackOverflowError
@RobWin
Copy link
Member

RobWin commented Oct 28, 2023

Hi,
thank you for the analysis.
First time that I see this. Do you have a proposal how to fix it?

@RobWin RobWin added the bug label Oct 28, 2023
@storozhukBM
Copy link
Member

@polkosity
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/atomic/AtomicReference.html#getAndUpdate(java.util.function.UnaryOperator) can't fail, it will be retried until update is successful.

I think the problem is here

where compareAndSet can return false to one thread before the other thread has finished stateReference.getAndUpdate execution. I think fix should remove isOpen boolean entirely and rely only on stateReference.

@RobWin what do you think?

@storozhukBM
Copy link
Member

One more thing: if we decide to remove isOpen entirely, it can lead to high contention in stateTransition method, so to remedy this situation, we can rewrite it to use the CAS loop with checks if stateReference is not in the target state already before retries.

@Hartigan
Copy link
Contributor

Hartigan commented Dec 1, 2023

Hi
I think I found root cause of problem. I created test which reproduce problem. I found what error occurs when circuitBreakerConfig.isAutomaticTransitionFromOpenToHalfOpenEnabled() enabled. If transition from OPEN to HALF_OPEN state occurs from this future we face this scenario:

  1. isOpen change to false
  2. Start transition with preTransitionHook
  3. Future cancel itself with mayInterruptIfRunning == true

Therefore, thread with future interrupted after isOpen changed to false and before stateReference change.
Next call of tryAcquirePermission creates infinite recursion and stack overflow.

@fatso83
Copy link

fatso83 commented Dec 5, 2023

Also being affected by this in a couple of systems. Not sure how we can help in supplying data.

@Hartigan
Copy link
Contributor

Hartigan commented Dec 5, 2023

I created pull request with fix. Also I removed unnecessary synchronized.

@Hartigan
Copy link
Contributor

Hartigan commented Dec 5, 2023

I found what we cannot remove synchronized because of this code. Pull request already updated

@RobWin
Copy link
Member

RobWin commented Dec 13, 2023

@fatso83 do you really need to enable automaticTransitionFromOpenToHalfOpen? I assume we didn't find this issue earlier, because most people don't use it. You could disable it temporary until it is fixed.

@RobWin
Copy link
Member

RobWin commented Dec 13, 2023

@Hartigan Thanks for you awesome analysis.

@fatso83
Copy link

fatso83 commented Dec 14, 2023

@RobWin Thanks for the hint! This worked for us. We are quite a big org where we had a wrapper lib around resilience4j where this was enabled by default. Disabling it worked.

@fatso83
Copy link

fatso83 commented Dec 14, 2023

Should not this be closed by #2072, now that it is merged?

do you really need to enable automaticTransitionFromOpenToHalfOpen? I assume we didn't find this issue earlier, because most people don't use it

@RobWin We were discussing this at work and we don't really understand how/why you believe the library is being used with the automatic state handling disabled. I am guessing this is down to some misunderstanding (on our part, most probably 😄). I would think automatic handling is something most people would want? Otherwise one has to manually implement some kind of logic to close the circuit. Something seems off. Right now it seems like we need to listen for state transitions and create a Timer that asks to close the circuit x seconds into the future. This seems a bit simplistic. What are we missing here?

This is just us trying to learn more/understand more, since we think we are missing out on some obvious fact.

@RobWin
Copy link
Member

RobWin commented Dec 15, 2023

@fatso83
The flag automaticTransitionFromOpenToHalfOpen is really an optional feature which was added at a later point in time.

The CircuitBreaker always automatically switches from open to half-open when a new request is processed. But it could mean that the state is shown as OPEN for a longer period of time until a new request arrives. Which confused some people who where looking at the CircuitBreaker state in systems which didn't have any traffic.
Thus we introduces a thread which automatically switches to HALF_OPEN after a period of time, even if no traffic is on the system. But we needed to have atomic guarantees since the thread and a new request could run concurrently.

But from the functional point of view there is no differences, except that it introduces problems right now :(

@RobWin RobWin added this to the 2.2.0 milestone Dec 17, 2023
@RobWin RobWin closed this as completed Dec 17, 2023
@fatso83
Copy link

fatso83 commented Dec 18, 2023

Aha, thank you. That also explains why it was an issue by us, as we trigger alerts when the circuit breakers are open. So to have the circuit breaker alerts not be warning us all night we would need this. Makes sense, thanks!

@RobWin
Copy link
Member

RobWin commented Dec 18, 2023

Well, If you have a production system which has not traffic at night, which would change the state automatically from open to half-open, why have warnings at night at all ;P

@RobWin
Copy link
Member

RobWin commented Jan 17, 2024

Seems the issue is not yet fixed.
Some user has reported the same issue in v2.2.0
#2095

@RobWin RobWin reopened this Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants