Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: implement the circuit breaker pattern for asynchronous set operations in the cache client #7010

Merged
merged 6 commits into from Feb 25, 2024

Conversation

damnever
Copy link
Contributor

@damnever damnever commented Dec 26, 2023

Ref #6975

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Wrap SetAsync with a circuit breaker to dynamically limit the cache backfilling requests. Both the memcached and redis clients have added the same circuit breaker configuration.

Verification

Added some unit tests.

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it is great work and it makes a lot of sense. Thanks.
We need to update docs.

pkg/cacheutil/memcached_client.go Show resolved Hide resolved
pkg/cacheutil/memcached_client_test.go Show resolved Hide resolved
SetAsyncCircuitBreakerOpenDuration: 5 * time.Second,
SetAsyncCircuitBreakerMinRequests: 50,
SetAsyncCircuitBreakerConsecutiveFailures: 5,
SetAsyncCircuitBreakerFailurePercent: 0.05,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to keep an eye on this to make sure those values sane.
So far, it looks good to me. Would be better to have another pair of eyes on it

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docs with new fields + add a paragraph explaining how this works 😄

… in the cache client

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
@damnever damnever force-pushed the cache/circuitbreaker branch 3 times, most recently from 44ca87c to c69653b Compare January 25, 2024 10:54
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
@damnever damnever force-pushed the cache/circuitbreaker branch 3 times, most recently from 2ad5763 to 7ae1164 Compare January 25, 2024 11:27
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
yeya24
yeya24 previously approved these changes Jan 26, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24
Copy link
Contributor

yeya24 commented Jan 28, 2024

17:08:51 redis-redis: 1:C 26 Jan 2024 17:08:51.802 # Configuration loaded
17:08:51 redis-redis: 1:M 26 Jan 2024 17:08:51.802 * monotonic clock: POSIX clock_gettime
17:08:51 redis-redis: 1:M 26 Jan 2024 17:08:51.803 * Running mode=standalone, port=6379.
17:08:51 redis-redis: 1:M 26 Jan 2024 17:08:51.803 # Server initialized
17:08:51 redis-redis: 1:M 26 Jan 2024 17:08:51.803 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
17:08:51 redis-redis: 1:M 26 Jan 2024 17:08:51.803 * Ready to accept connections
17:08:52 static: 172.30.0.8 - - [26/Jan/2024:17:08:52 +0000] "GET /metrics HTTP/1.1" 200 172 "-" "Prometheus/2.41.0" "-"
17:08:52 Ports for container redis-client-redis-redis >> Local ports: map[redis:6379] Ports available from host: map[redis:32919]
    store_gateway_test.go:1001: store_gateway_test.go:1001: ""
        
         unexpected error: set async circuit breaker: consecutive failures must be greater than 0
        
17:08:52 Killing redis-redis
=== CONT  TestRangeQueryShardingWithRandomData
--- FAIL: TestRedisClient_Rueidis (2.90s)

We need to fix E2E test. @damnever https://github.com/thanos-io/thanos/actions/runs/7665969811/job/20908734057?pr=7010

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
@yeya24
Copy link
Contributor

yeya24 commented Feb 1, 2024

@damnever There is one more lint to fix before we can merge. Thanks

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
yeya24
yeya24 previously approved these changes Feb 1, 2024
@yeya24
Copy link
Contributor

yeya24 commented Feb 1, 2024

@GiedriusS Wanna take another look before I merge this code change?

MaxRequests: config.SetAsyncCircuitBreakerHalfOpenMaxRequests,
Interval: 10 * time.Second,
Timeout: config.SetAsyncCircuitBreakerOpenDuration,
ReadyToTrip: func(counts gobreaker.Counts) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be factored out into another function? I think it's exactly the same in memcached_client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separated the entire configuration and logic, including this function. Please take another look.

@@ -340,6 +346,12 @@ While the remaining settings are **optional**:
- `max_async_concurrency`: maximum number of concurrent asynchronous operations can occur.
- `max_async_buffer_size`: maximum number of enqueued asynchronous operations allowed.
- `max_get_multi_concurrency`: maximum number of concurrent connections when fetching keys. If set to `0`, the concurrency is unlimited.
- `set_async_circuit_breaker_enabled`: `true` to enable circuite breaker for asynchronous operations. The circuit breaker consists of three states: closed, half-open, and open. It begins in the closed state. When the total requests exceed `set_async_circuit_breaker_min_requests`, and either consecutive failures occur or the failure percentage is excessively high according to the configured values, the circuit breaker transitions to the open state. This results in the rejection of all asynchronous operations. After `set_async_circuit_breaker_open_duration`, the circuit breaker transitions to the half-open state, where it allows `set_async_circuit_breaker_half_open_max_requests` asynchronous operations to be processed in order to test if the conditions have improved. If they have not, the state transitions back to open; if they have, it transitions to the closed state. Following each 10 seconds interval in the closed state, the circuit breaker resets its metrics and repeats this cycle.
- `set_async_circuit_breaker_half_open_max_requests`: maximum number of requests allowed to pass through when the circuit breaker is half-open. If set to 0, the circuit breaker allows only 1 request.
- `set_async_circuit_breaker_open_duration`: the period of the open state after which the state of the circuit breaker becomes half-open. If set to 0, the circuit breaker resets it to 60 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set to 0, the circuit breaker resets it to 60 seconds.

What does it mean? That the default value of this is 60 seconds? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, github.com/sony/gobreaker uses this default value.

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
@yeya24
Copy link
Contributor

yeya24 commented Feb 20, 2024

E2E test failure

=== NAME  TestRuleNativeHistograms
    native_histograms_test.go:391: native_histograms_test.go:391: ""
        
         unexpected error: unable to find metrics [prometheus_remote_storage_histograms_total] with expected values after 50 retries. Last error: <nil>. Last values: [0]

Doesn't seem like related. Ignoring for now

@yeya24
Copy link
Contributor

yeya24 commented Feb 25, 2024

Hey @GiedriusS, I will merge this pr now. If you find anything else that's worth improving, feel free to create a separate issue and we can address later.

@yeya24 yeya24 merged commit f72b767 into thanos-io:main Feb 25, 2024
19 of 20 checks passed
@GiedriusS
Copy link
Member

👍 I am also thinking about whether it makes sense to only protect SET operations through the circuit breaker pattern? I mean if the cache is down then GETs won't work either.

@yeya24
Copy link
Contributor

yeya24 commented Feb 26, 2024

I think what you said makes sense. @GiedriusS
Need to test this one out first and we can improve GETs later.
WDYT @damnever ?

@damnever
Copy link
Contributor Author

I agree; I think GET is more sensitive to latency and should be assigned a higher priority.

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe let's remove the set_ prefix from config and put gets under the circuit breaker too? Should be a small change. Ideally this would be done before the next release

@damnever
Copy link
Contributor Author

Removing the set_ prefix sounds fine, but I recommend using a different circuit breaker for the GET operation. The intent of this PR is to try to limit the resources used by async set operations(#6975), subsequently benefiting other operations such as GETs.

jnyi pushed a commit to jnyi/thanos that referenced this pull request Apr 4, 2024
…rations in the cache client (thanos-io#7010)

* Implement the circuit breaker pattern for asynchronous set operations in the cache client

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Add feature flag for circuitbreaker

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Sync docs

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Skip configuration validation if the circuit breaker is disabled

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Make lint happy

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Abstract the logic of the circuit breaker

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

---------

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants