Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a circuit breaker layer #5134

Merged
merged 1 commit into from
Jul 5, 2024
Merged

Added a circuit breaker layer #5134

merged 1 commit into from
Jul 5, 2024

Conversation

fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Jun 17, 2024

Adds a circuit breaker layer.

The piece that estimates whether the next request is likely to fail is extremely simplistic for the moment.
It simply counter the number of errors (not taking in account successes) that happened in a given time window.

The reason is that for the moment, we want to use it for persist requests when the WAL is full.
On airmail, the aggressive retry logic of the client was causing a massive grpc storm on the faulty indexer node,
taking all of its CPU and preventing it from getting out of that state.

In this case, the error estimation logic is very simple, a full WAL guarantees that no further persist request will be successful for a little while.

Tested

  • on unit tests
  • on simian, by voluntarily setting a WAL size to 500mb.
  • on airmail.

Copy link

github-actions bot commented Jun 17, 2024

On SSD:

Average search latency is 1.01x that of the reference (lower is better).
Ref run id: 2185, ref commit: 7547ac3
Link

On GCS:

Average search latency is 1.06x that of the reference (lower is better).
Ref run id: 2186, ref commit: 7547ac3
Link

@fulmicoton fulmicoton force-pushed the circuit_breaker branch 10 times, most recently from 18d9bda to b3197e0 Compare June 18, 2024 07:44
@fulmicoton fulmicoton marked this pull request as ready for review June 18, 2024 08:38
@fulmicoton fulmicoton requested a review from guilload June 19, 2024 03:31
The piece that estimates whether the next request is likely to fail is extremely simplistic for the moment.
It simply counter the number of errors (not taking in account successes) that happened in a given time window.

The reason is that for the moment, we want to use it for persist requests when the WAL is full.
On airmail, the aggressive retry logic of the client was causing a massive grpc storm on the faulty indexer node,
taking all of its CPU and preventing it from getting out of that state.

In this case, the error estimation logic is very simple, a full WAL guarantees that no further persist request will be successful for a little while.
@fulmicoton fulmicoton merged commit 1fba1d1 into main Jul 5, 2024
5 checks passed
@fulmicoton fulmicoton deleted the circuit_breaker branch July 5, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant