Skip to content

Commit

Permalink
cache: implement the circuit breaker pattern for asynchronous set ope…
Browse files Browse the repository at this point in the history
…rations in the cache client (#7010)

* Implement the circuit breaker pattern for asynchronous set operations in the cache client

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Add feature flag for circuitbreaker

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Sync docs

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Skip configuration validation if the circuit breaker is disabled

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Make lint happy

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

* Abstract the logic of the circuit breaker

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>

---------

Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
  • Loading branch information
damnever committed Feb 25, 2024
1 parent 2f1f83f commit f72b767
Show file tree
Hide file tree
Showing 8 changed files with 348 additions and 22 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -62,6 +62,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#6887](https://github.com/thanos-io/thanos/pull/6887) Query Frontend: *breaking :warning:* Add tenant label to relevant exported metrics. Note that this change may cause some pre-existing custom dashboard queries to be incorrect due to the added label.
- [#7028](https://github.com/thanos-io/thanos/pull/7028) Query|Query Frontend: Add new `--query-frontend.enable-x-functions` flag to enable experimental extended functions.
- [#6884](https://github.com/thanos-io/thanos/pull/6884) Tools: Add upload-block command to upload blocks to object storage.
- [#7010](https://github.com/thanos-io/thanos/pull/7010) Cache: Added `set_async_circuit_breaker_*` to utilize the circuit breaker pattern for dynamically thresholding asynchronous set operations.

### Changed

Expand Down
14 changes: 14 additions & 0 deletions docs/components/query-frontend.md
Expand Up @@ -77,6 +77,13 @@ config:
max_get_multi_batch_size: 0
dns_provider_update_interval: 0s
auto_discovery: false
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 0
open_duration: 0s
min_requests: 0
consecutive_failures: 0
failure_percent: 0
expiration: 0s
```

Expand Down Expand Up @@ -132,6 +139,13 @@ config:
master_name: ""
max_async_buffer_size: 10000
max_async_concurrency: 20
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 10
open_duration: 5s
min_requests: 50
consecutive_failures: 5
failure_percent: 0.05
expiration: 24h0m0s
```

Expand Down
21 changes: 21 additions & 0 deletions docs/components/store.md
Expand Up @@ -334,6 +334,13 @@ config:
max_get_multi_batch_size: 0
dns_provider_update_interval: 0s
auto_discovery: false
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 0
open_duration: 0s
min_requests: 0
consecutive_failures: 0
failure_percent: 0
enabled_items: []
ttl: 0s
```
Expand All @@ -353,6 +360,13 @@ While the remaining settings are **optional**:
- `max_item_size`: maximum size of an item to be stored in memcached. This option should be set to the same value of memcached `-I` flag (defaults to 1MB) in order to avoid wasting network round trips to store items larger than the max item size allowed in memcached. If set to `0`, the item size is unlimited.
- `dns_provider_update_interval`: the DNS discovery update interval.
- `auto_discovery`: whether to use the auto-discovery mechanism for memcached.
- `set_async_circuit_breaker_config`: the configuration for the circuit breaker for asynchronous set operations.
- `enabled`: `true` to enable circuite breaker for asynchronous operations. The circuit breaker consists of three states: closed, half-open, and open. It begins in the closed state. When the total requests exceed `min_requests`, and either consecutive failures occur or the failure percentage is excessively high according to the configured values, the circuit breaker transitions to the open state. This results in the rejection of all asynchronous operations. After `open_duration`, the circuit breaker transitions to the half-open state, where it allows `half_open_max_requests` asynchronous operations to be processed in order to test if the conditions have improved. If they have not, the state transitions back to open; if they have, it transitions to the closed state. Following each 10 seconds interval in the closed state, the circuit breaker resets its metrics and repeats this cycle.
- `half_open_max_requests`: maximum number of requests allowed to pass through when the circuit breaker is half-open. If set to 0, the circuit breaker allows only 1 request.
- `open_duration`: the period of the open state after which the state of the circuit breaker becomes half-open. If set to 0, the circuit breaker utilizes the default value of 60 seconds.
- `min_requests`: minimal requests to trigger the circuit breaker, 0 signifies no requirements.
- `consecutive_failures`: consecutive failures based on `min_requests` to determine if the circuit breaker should open.
- `failure_percent`: the failure percentage, which is based on `min_requests`, to determine if the circuit breaker should open.
- `enabled_items`: selectively choose what types of items to cache. Supported values are `Postings`, `Series` and `ExpandedPostings`. By default, all items are cached.
- `ttl`: ttl to store index cache items in memcached.

Expand Down Expand Up @@ -385,6 +399,13 @@ config:
master_name: ""
max_async_buffer_size: 10000
max_async_concurrency: 20
set_async_circuit_breaker_config:
enabled: false
half_open_max_requests: 10
open_duration: 5s
min_requests: 50
consecutive_failures: 5
failure_percent: 0.05
enabled_items: []
ttl: 0s
```
Expand Down
87 changes: 87 additions & 0 deletions pkg/cacheutil/cacheutil.go
Expand Up @@ -5,12 +5,29 @@ package cacheutil

import (
"context"
"time"

"github.com/pkg/errors"
"github.com/sony/gobreaker"
"golang.org/x/sync/errgroup"

"github.com/thanos-io/thanos/pkg/gate"
)

var (
errCircuitBreakerConsecutiveFailuresNotPositive = errors.New("circuit breaker: consecutive failures must be greater than 0")
errCircuitBreakerFailurePercentInvalid = errors.New("circuit breaker: failure percent must be in range (0,1]")

defaultCircuitBreakerConfig = CircuitBreakerConfig{
Enabled: false,
HalfOpenMaxRequests: 10,
OpenDuration: 5 * time.Second,
MinRequests: 50,
ConsecutiveFailures: 5,
FailurePercent: 0.05,
}
)

// doWithBatch do func with batch and gate. batchSize==0 means one batch. gate==nil means no gate.
func doWithBatch(ctx context.Context, totalSize int, batchSize int, ga gate.Gate, f func(startIndex, endIndex int) error) error {
if totalSize == 0 {
Expand Down Expand Up @@ -40,3 +57,73 @@ func doWithBatch(ctx context.Context, totalSize int, batchSize int, ga gate.Gate
}
return g.Wait()
}

// CircuitBreaker implements the circuit breaker pattern https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern.
type CircuitBreaker interface {
Execute(func() error) error
}

// CircuitBreakerConfig is the config for the circuite breaker.
type CircuitBreakerConfig struct {
// Enabled enables circuite breaker.
Enabled bool `yaml:"enabled"`

// HalfOpenMaxRequests is the maximum number of requests allowed to pass through
// when the circuit breaker is half-open.
// If set to 0, the circuit breaker allows only 1 request.
HalfOpenMaxRequests uint32 `yaml:"half_open_max_requests"`
// OpenDuration is the period of the open state after which the state of the circuit breaker becomes half-open.
// If set to 0, the circuit breaker utilizes the default value of 60 seconds.
OpenDuration time.Duration `yaml:"open_duration"`
// MinRequests is minimal requests to trigger the circuit breaker.
MinRequests uint32 `yaml:"min_requests"`
// ConsecutiveFailures represents consecutive failures based on CircuitBreakerMinRequests to determine if the circuit breaker should open.
ConsecutiveFailures uint32 `yaml:"consecutive_failures"`
// FailurePercent represents the failure percentage, which is based on CircuitBreakerMinRequests, to determine if the circuit breaker should open.
FailurePercent float64 `yaml:"failure_percent"`
}

func (c CircuitBreakerConfig) validate() error {
if !c.Enabled {
return nil
}
if c.ConsecutiveFailures == 0 {
return errCircuitBreakerConsecutiveFailuresNotPositive
}
if c.FailurePercent <= 0 || c.FailurePercent > 1 {
return errCircuitBreakerFailurePercentInvalid
}
return nil
}

type noopCircuitBreaker struct{}

func (noopCircuitBreaker) Execute(f func() error) error { return f() }

type gobreakerCircuitBreaker struct {
*gobreaker.CircuitBreaker
}

func (cb gobreakerCircuitBreaker) Execute(f func() error) error {
_, err := cb.CircuitBreaker.Execute(func() (any, error) {
return nil, f()
})
return err
}

func newCircuitBreaker(name string, config CircuitBreakerConfig) CircuitBreaker {
if !config.Enabled {
return noopCircuitBreaker{}
}
return gobreakerCircuitBreaker{gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: name,
MaxRequests: config.HalfOpenMaxRequests,
Interval: 10 * time.Second,
Timeout: config.OpenDuration,
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.Requests >= config.MinRequests &&
(counts.ConsecutiveFailures >= uint32(config.ConsecutiveFailures) ||
float64(counts.TotalFailures)/float64(counts.Requests) >= config.FailurePercent)
},
})}
}
51 changes: 36 additions & 15 deletions pkg/cacheutil/memcached_client.go
Expand Up @@ -17,6 +17,7 @@ import (
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/sony/gobreaker"
"gopkg.in/yaml.v2"

"github.com/thanos-io/thanos/pkg/discovery/dns"
Expand Down Expand Up @@ -54,6 +55,8 @@ var (
MaxGetMultiBatchSize: 0,
DNSProviderUpdateInterval: 10 * time.Second,
AutoDiscovery: false,

SetAsyncCircuitBreaker: defaultCircuitBreakerConfig,
}
)

Expand Down Expand Up @@ -141,6 +144,9 @@ type MemcachedClientConfig struct {

// AutoDiscovery configures memached client to perform auto-discovery instead of DNS resolution
AutoDiscovery bool `yaml:"auto_discovery"`

// SetAsyncCircuitBreaker configures the circuit breaker for SetAsync operations.
SetAsyncCircuitBreaker CircuitBreakerConfig `yaml:"set_async_circuit_breaker_config"`
}

func (c *MemcachedClientConfig) validate() error {
Expand All @@ -158,6 +164,9 @@ func (c *MemcachedClientConfig) validate() error {
return errMemcachedMaxAsyncConcurrencyNotPositive
}

if err := c.SetAsyncCircuitBreaker.validate(); err != nil {
return err
}
return nil
}

Expand Down Expand Up @@ -195,6 +204,8 @@ type memcachedClient struct {
dataSize *prometheus.HistogramVec

p *AsyncOperationProcessor

setAsyncCircuitBreaker CircuitBreaker
}

// AddressProvider performs node address resolution given a list of clusters.
Expand Down Expand Up @@ -277,7 +288,8 @@ func newMemcachedClient(
config.MaxGetMultiConcurrency,
gate.Gets,
),
p: NewAsyncOperationProcessor(config.MaxAsyncBufferSize, config.MaxAsyncConcurrency),
p: NewAsyncOperationProcessor(config.MaxAsyncBufferSize, config.MaxAsyncConcurrency),
setAsyncCircuitBreaker: newCircuitBreaker("memcached-set-async", config.SetAsyncCircuitBreaker),
}

c.clientInfo = promauto.With(reg).NewGaugeFunc(prometheus.GaugeOpts{
Expand Down Expand Up @@ -375,22 +387,31 @@ func (c *memcachedClient) SetAsync(key string, value []byte, ttl time.Duration)
start := time.Now()
c.operations.WithLabelValues(opSet).Inc()

err := c.client.Set(&memcache.Item{
Key: key,
Value: value,
Expiration: int32(time.Now().Add(ttl).Unix()),
err := c.setAsyncCircuitBreaker.Execute(func() error {
return c.client.Set(&memcache.Item{
Key: key,
Value: value,
Expiration: int32(time.Now().Add(ttl).Unix()),
})
})
if err != nil {
// If the PickServer will fail for any reason the server address will be nil
// and so missing in the logs. We're OK with that (it's a best effort).
serverAddr, _ := c.selector.PickServer(key)
level.Debug(c.logger).Log(
"msg", "failed to store item to memcached",
"key", key,
"sizeBytes", len(value),
"server", serverAddr,
"err", err,
)
if errors.Is(err, gobreaker.ErrOpenState) || errors.Is(err, gobreaker.ErrTooManyRequests) {
level.Warn(c.logger).Log(
"msg", "circuit breaker disallows storing item in memcached",
"key", key,
"err", err)
} else {
// If the PickServer will fail for any reason the server address will be nil
// and so missing in the logs. We're OK with that (it's a best effort).
serverAddr, _ := c.selector.PickServer(key)
level.Debug(c.logger).Log(
"msg", "failed to store item to memcached",
"key", key,
"sizeBytes", len(value),
"server", serverAddr,
"err", err,
)
}
c.trackError(opSet, err)
return
}
Expand Down

0 comments on commit f72b767

Please sign in to comment.