-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource_manager: add degraded mode #6063
resource_manager: add degraded mode #6063
Conversation
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> fix data race Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
925b286
to
dbf29c3
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #6063 +/- ##
=======================================
Coverage 74.47% 74.48%
=======================================
Files 393 393
Lines 38446 38519 +73
=======================================
+ Hits 28631 28689 +58
- Misses 7275 7290 +15
Partials 2540 2540
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 26 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
We can have a switch to enable/disable degraded mode. |
@@ -297,6 +320,10 @@ func (c *ResourceGroupsController) sendTokenBucketRequests(ctx context.Context, | |||
Requests: requests, | |||
TargetRequestPeriodMs: uint64(defaultTargetPeriod / time.Millisecond), | |||
} | |||
if c.run.responseDeadline == nil { | |||
c.run.responseDeadline = time.NewTimer(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using Stop()
and Rest()
here to re-use the timer as much as possible rather than creating a new one each time.
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> add degraded mode switch Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
) | ||
} | ||
|
||
// GenerateConfig generates the configuration by the given request unit configuration. | ||
func GenerateConfig(ruConfig *RequestUnitConfig) *Config { | ||
func GenerateConfig(ruConfig *RequestUnitConfig, rmServerConfig *RMServerConfig) *Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge the ruConfig
and rmServerConfig
into one?
@@ -210,6 +247,7 @@ func (c *ResourceGroupsController) Stop() error { | |||
return errors.Errorf("resource groups controller does not start") | |||
} | |||
c.loopCancel() | |||
c.run.responseDeadline.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary since there is a defer
already before?
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@@ -112,14 +117,17 @@ func NewResourceGroupController( | |||
requestUnitConfig *RequestUnitConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't decide if we need to use the specified config on the client side, so I keep it.
@@ -31,7 +32,7 @@ import ( | |||
) | |||
|
|||
const ( | |||
requestUnitConfigPath = "resource_group/ru_config" | |||
controllerConfigPath = "resource_group/control" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a little bit more specific would be better.
controllerConfigPath = "resource_group/control" | |
controllerConfigPath = "resource_group/controller_config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the suffix? Other config key path don't have "_config".
|
||
// RequestUnit is the configuration determines the coefficients of the RRU and WRU cost. | ||
// This configuration should be modified carefully. | ||
RequestUnit RequestUnitConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the toml and json tags?
type ControllerConfig struct { | ||
// EnableDegradedMode is to control whether resource control client enable degraded mode when server is disconnect. | ||
EnableDegradedMode bool `toml:"enable-degraded-mode" json:"enable-degraded-mode"` | ||
|
||
// RequestUnit is the configuration determines the coefficients of the RRU and WRU cost. | ||
// This configuration should be modified carefully. | ||
RequestUnit RequestUnitConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@@ -161,6 +169,9 @@ func (c *ResourceGroupsController) Start(ctx context.Context) { | |||
c.initRunState() | |||
c.loopCtx, c.loopCancel = context.WithCancel(ctx) | |||
go func() { | |||
c.run.responseDeadline = time.NewTimer(time.Second) | |||
c.run.responseDeadline.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref #6063 (comment), we new a Timer but don't need it send Time channel at beginning.
@@ -344,6 +381,10 @@ func (c *ResourceGroupsController) sendTokenBucketRequests(ctx context.Context, | |||
Requests: requests, | |||
TargetRequestPeriodMs: uint64(defaultTargetPeriod / time.Millisecond), | |||
} | |||
if c.responseDeadlineCh == nil { | |||
c.run.responseDeadline.Reset(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about considering making the deadline timeout be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
if c.responseDeadlineCh != nil { | ||
if c.run.responseDeadline.Stop() { | ||
select { | ||
case <-c.run.responseDeadline.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment of Timer.Stop()
,
Stop does not close the channel, to prevent a read from the channel succeedingincorrectly. To ensure the channel is empty after a call to Stop, check the return value and drain the channel.
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
@@ -195,6 +212,8 @@ func (c *ResourceGroupsController) Start(ctx context.Context) { | |||
c.updateAvgRequestResourcePerSec() | |||
if !c.run.requestInProgress { | |||
c.collectTokenBucketRequests(c.loopCtx, "low_ru", true /* only select low tokens resource group */) | |||
} else if c.run.inDegradedMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder not in else may avoid some unexpect in here.
if c.run.inDegradedMode
|
||
func (gc *groupCostController) applyBasicConfigForRawResourceTokenCounter() { | ||
for typ, counter := range gc.run.resourceTokens { | ||
if !counter.limiter.IsLowTokens() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can not skip it to make sure be in downgrade, or add log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want low-token buckets to enter degraded mode. It will log if successfully enter degraded mode
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks duplicated
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -61,8 +61,28 @@ const ( | |||
defaultWriteCostPerByte = 1. / 1024 | |||
// 1 RU = 3 millisecond CPU time | |||
defaultCPUMsCost = 1. / 3 | |||
|
|||
defaultDegradedModeWaitDuration = "1s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can off this by default first. after do more test then on it.
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> address comment Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
cb4c4c8
to
e453550
Compare
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
ptal @JmPotato |
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
pkg/storage/endpoint/key_path.go
Outdated
@@ -45,7 +45,7 @@ const ( | |||
// resource group storage endpoint has prefix `resource_group` | |||
resourceGroupSettingsPath = "settings" | |||
resourceGroupStatesPath = "states" | |||
requestUnitConfigPath = "ru_config" | |||
requestUnitConfigPath = "controller" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to change this requestUnitConfigPath
name?
@@ -61,7 +61,7 @@ func (se *StorageEndpoint) LoadResourceGroupStates(f func(k, v string)) error { | |||
return se.loadRangeByPrefix(resourceGroupStatesPath+"/", f) | |||
} | |||
|
|||
// SaveRequestUnitConfig stores the request unit config to storage. | |||
func (se *StorageEndpoint) SaveRequestUnitConfig(config interface{}) error { | |||
// SaveControllerConfig stores the request unit config to storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for comment.
@HuSharp: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/merge |
@JmPotato: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: fa69657
|
What problem does this PR solve?
Issue Number: ref #5851
What is changed and how does it work?
Add a timer after each token request is sent. If the timer does not return successfully after one second, the controller will enter degraded mode. In degraded mode, a resource group in low-token process will receive the same fill rate as the configured.
Check List
Tests
Code changes
Release note