Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource_manager: add degraded mode #6063

Merged
merged 24 commits into from
Mar 16, 2023

Conversation

CabinfeverB
Copy link
Member

@CabinfeverB CabinfeverB commented Feb 28, 2023

What problem does this PR solve?

Issue Number: ref #5851

What is changed and how does it work?

Add a timer after each token request is sent. If the timer does not return successfully after one second, the controller will enter degraded mode. In degraded mode, a resource group in low-token process will receive the same fill rate as the configured.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has the configuration change
  • Has persistent data change

Release note

None.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 28, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JmPotato
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@CabinfeverB CabinfeverB requested review from nolouch and removed request for Yisaer February 28, 2023 12:19
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

fix data race

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Patch coverage: 71.79% and no project coverage change.

Comparison is base (8ba42df) 74.47% compared to head (fa69657) 74.48%.

❗ Current head fa69657 differs from pull request most recent head 7524367. Consider uploading reports for the commit 7524367 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6063   +/-   ##
=======================================
  Coverage   74.47%   74.48%           
=======================================
  Files         393      393           
  Lines       38446    38519   +73     
=======================================
+ Hits        28631    28689   +58     
- Misses       7275     7290   +15     
  Partials     2540     2540           
Flag Coverage Δ
unittests 74.48% <71.79%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/storage/endpoint/key_path.go 93.33% <ø> (ø)
server/server.go 75.27% <0.00%> (ø)
client/resource_group/controller/controller.go 62.20% <68.35%> (+1.43%) ⬆️
pkg/mcs/resource_manager/server/config.go 69.73% <71.42%> (-0.27%) ⬇️
client/resource_group/controller/config.go 87.50% <80.00%> (-12.50%) ⬇️
client/resource_group/controller/limiter.go 69.93% <100.00%> (+8.68%) ⬆️
pkg/mcs/resource_manager/server/grpc_service.go 67.77% <100.00%> (+0.73%) ⬆️
pkg/mcs/resource_manager/server/manager.go 82.20% <100.00%> (ø)
pkg/mcs/resource_manager/server/server.go 59.22% <100.00%> (ø)
pkg/storage/endpoint/resource_group.go 85.71% <100.00%> (ø)
... and 1 more

... and 26 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@kevin-xianliu
Copy link

We can have a switch to enable/disable degraded mode.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2023
@@ -297,6 +320,10 @@ func (c *ResourceGroupsController) sendTokenBucketRequests(ctx context.Context,
Requests: requests,
TargetRequestPeriodMs: uint64(defaultTargetPeriod / time.Millisecond),
}
if c.run.responseDeadline == nil {
c.run.responseDeadline = time.NewTimer(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using Stop() and Rest() here to re-use the timer as much as possible rather than creating a new one each time.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

add degraded mode switch

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2023
)
}

// GenerateConfig generates the configuration by the given request unit configuration.
func GenerateConfig(ruConfig *RequestUnitConfig) *Config {
func GenerateConfig(ruConfig *RequestUnitConfig, rmServerConfig *RMServerConfig) *Config {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge the ruConfig and rmServerConfig into one?

@@ -210,6 +247,7 @@ func (c *ResourceGroupsController) Stop() error {
return errors.Errorf("resource groups controller does not start")
}
c.loopCancel()
c.run.responseDeadline.Stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary since there is a defer already before?

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2023
@@ -112,14 +117,17 @@ func NewResourceGroupController(
requestUnitConfig *RequestUnitConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't decide if we need to use the specified config on the client side, so I keep it.

@@ -31,7 +32,7 @@ import (
)

const (
requestUnitConfigPath = "resource_group/ru_config"
controllerConfigPath = "resource_group/control"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little bit more specific would be better.

Suggested change
controllerConfigPath = "resource_group/control"
controllerConfigPath = "resource_group/controller_config"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the suffix? Other config key path don't have "_config".


// RequestUnit is the configuration determines the coefficients of the RRU and WRU cost.
// This configuration should be modified carefully.
RequestUnit RequestUnitConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the toml and json tags?

type ControllerConfig struct {
// EnableDegradedMode is to control whether resource control client enable degraded mode when server is disconnect.
EnableDegradedMode bool `toml:"enable-degraded-mode" json:"enable-degraded-mode"`

// RequestUnit is the configuration determines the coefficients of the RRU and WRU cost.
// This configuration should be modified carefully.
RequestUnit RequestUnitConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@@ -161,6 +169,9 @@ func (c *ResourceGroupsController) Start(ctx context.Context) {
c.initRunState()
c.loopCtx, c.loopCancel = context.WithCancel(ctx)
go func() {
c.run.responseDeadline = time.NewTimer(time.Second)
c.run.responseDeadline.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #6063 (comment), we new a Timer but don't need it send Time channel at beginning.

@@ -344,6 +381,10 @@ func (c *ResourceGroupsController) sendTokenBucketRequests(ctx context.Context,
Requests: requests,
TargetRequestPeriodMs: uint64(defaultTargetPeriod / time.Millisecond),
}
if c.responseDeadlineCh == nil {
c.run.responseDeadline.Reset(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about considering making the deadline timeout be configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

if c.responseDeadlineCh != nil {
if c.run.responseDeadline.Stop() {
select {
case <-c.run.responseDeadline.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment of Timer.Stop(),

Stop does not close the channel, to prevent a read from the channel succeedingincorrectly. To ensure the channel is empty after a call to Stop, check the return value and drain the channel.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

@@ -195,6 +212,8 @@ func (c *ResourceGroupsController) Start(ctx context.Context) {
c.updateAvgRequestResourcePerSec()
if !c.run.requestInProgress {
c.collectTokenBucketRequests(c.loopCtx, "low_ru", true /* only select low tokens resource group */)
} else if c.run.inDegradedMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder not in else may avoid some unexpect in here.

if c.run.inDegradedMode


func (gc *groupCostController) applyBasicConfigForRawResourceTokenCounter() {
for typ, counter := range gc.run.resourceTokens {
if !counter.limiter.IsLowTokens() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can not skip it to make sure be in downgrade, or add log here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want low-token buckets to enter degraded mode. It will log if successfully enter degraded mode

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2023
Comment on lines 127 to 129
if err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks duplicated

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -61,8 +61,28 @@ const (
defaultWriteCostPerByte = 1. / 1024
// 1 RU = 3 millisecond CPU time
defaultCPUMsCost = 1. / 3

defaultDegradedModeWaitDuration = "1s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can off this by default first. after do more test then on it.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 13, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>

address comment

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@CabinfeverB CabinfeverB force-pushed the resource-manager/degraded_mode branch from cb4c4c8 to e453550 Compare March 13, 2023 07:40
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@nolouch
Copy link
Contributor

nolouch commented Mar 15, 2023

ptal @JmPotato

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Copy link
Member

@HuSharp HuSharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

@@ -45,7 +45,7 @@ const (
// resource group storage endpoint has prefix `resource_group`
resourceGroupSettingsPath = "settings"
resourceGroupStatesPath = "states"
requestUnitConfigPath = "ru_config"
requestUnitConfigPath = "controller"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to change this requestUnitConfigPath name?

@@ -61,7 +61,7 @@ func (se *StorageEndpoint) LoadResourceGroupStates(f func(k, v string)) error {
return se.loadRangeByPrefix(resourceGroupStatesPath+"/", f)
}

// SaveRequestUnitConfig stores the request unit config to storage.
func (se *StorageEndpoint) SaveRequestUnitConfig(config interface{}) error {
// SaveControllerConfig stores the request unit config to storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for comment.

@ti-chi-bot
Copy link
Member

@HuSharp: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

rest LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 16, 2023
@JmPotato
Copy link
Member

/merge

@ti-chi-bot
Copy link
Member

@JmPotato: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: fa69657

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 16, 2023
@ti-chi-bot ti-chi-bot merged commit 7a0ce10 into tikv:master Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants