Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7705 retryable http client #7723

Merged
merged 10 commits into from
May 6, 2024
Merged

7705 retryable http client #7723

merged 10 commits into from
May 6, 2024

Conversation

eladlachmi
Copy link
Contributor

Closes #7705

Change Description

Background

See the associated issue for context.

New Feature

This PR creates a new HTTP client for lakectl, based on go-retryablehttp. It supports retries with a custom retry policy tailored for lakeFS and an exponential backoff retry strategy. Its constructor func includes several parameters to tune the retry max count and intervals and the option to provide a custom http.Transport. For seamless integration, the constructor returns a standard *http.Client.

Testing Details

This PR includes a suite of tests for the custom retry policy. The retry mechanism was tested using the abuse command with a playground installation, a modified DDB table, and a small number of RCUs/WCUs.

The abuse command:

./lakectl abuse random-write  lakefs://sample-repo/main --parallelism 100 --prefix abuse-3/

Without retries, this command fails with HTTP 500 errors. With retires it retries the failures and completes successfully.

Thanks @itaiad200 for the abuse command ✌🏻

@eladlachmi eladlachmi added area/lakectl Issues related to lakeFS' command line interface (lakectl) include-changelog PR description should be included in next release changelog labels May 5, 2024
@eladlachmi eladlachmi self-assigned this May 5, 2024
Copy link

github-actions bot commented May 5, 2024

🎊 PR Preview 376e69b has been successfully built and deployed to https://treeverse-lakeFS-preview-pr-7723.surge.sh

🕐 Build time: 0.014s

🤖 By surge-preview

Copy link

github-actions bot commented May 5, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link
Contributor

@guy-har guy-har left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving due to urgency, there are no blocking comments, but please do consider them.

@@ -87,7 +87,7 @@ lakectl [flags]
{:.no_toc}

```
--base-uri string base URI used for lakeFS address parse
--base-uri string base URI used for lakeFS address parse (default "https://treeverse-proper-boar-yjz5us.us-east-1.lakefscloud.ninja/api/v1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this have to do lakeFS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... that wasn't supposed to be here...

@@ -521,6 +538,10 @@ func initConfig() {
// set defaults
viper.SetDefault("metastore.hive.db_location_uri", "file:/user/hive/warehouse/")
viper.SetDefault("server.endpoint_url", "http://127.0.0.1:8000")
viper.SetDefault("server.retries.enabled", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the number of retries enough to decide if we have retries or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but I'd prefer to be explicit if it doesn't cost me anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't cost you, but does cost the user understanding
What happens if I don't want reties and I change retries to 0 because that's what I'm used to, what will happen here?

Comment on lines 443 to 449
if !cfg.Server.Retries.Enabled {
httpClient = &http.Client{
Transport: transport,
}
} else {
httpClient = NewRetryClient(cfg.Server.Retries.MaxRetries,
cfg.Server.Retries.MinWaitInterval, cfg.Server.Retries.MaxWaitInterval, transport)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: but I would expect the newRetryClient to get the httpClient and not the transport, that we can initiate the client regardless and then just wrap it with the retry client if needed

retryClient.RetryMax = maxRetries
retryClient.RetryWaitMin = retryWaitMin
retryClient.RetryWaitMax = retryWaitMax
retryClient.CheckRetry = lakectlRetryPolicy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe the default is good enough for us

Comment on lines 21 to 25
func NewRetryClient(maxRetries int, retryWaitMin, retryWaitMax time.Duration, transport *http.Transport) *http.Client {
retryClient := retryablehttp.NewClient()
if transport != nil {
retryClient.HTTPClient.Transport = transport
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that a simple function in cmd/root.go will do. Also cmd is not the correct directory for this, I would expect it to be somewhere more generic, such as pkg/httputil

retryClient.HTTPClient.Transport = transport
}
retryClient.Logger = nil
retryClient.RetryMax = maxRetries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws cli, uses maxAttemps IMO it's clear, In max retries I never remember if the first try counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way, but done.

Comment on lines 67 to 68
MinWaitInterval time.Duration `mapstructure:"min_wait_interval"`
MaxWaitInterval time.Duration `mapstructure:"max_wait_interval"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document this two

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inline comments. If you can point me to the documentation, I'll add it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a section here mentioning retries would be appropriate..

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! request-changes-worthy-comments:

  1. Lack of docs for the new values
  2. I think we should start with something that is less configureable.

docs/reference/cli.md Outdated Show resolved Hide resolved
Comment on lines 65 to 68
Enabled bool `mapstructure:"enabled"`
MaxRetries int `mapstructure:"max_retries"`
MinWaitInterval time.Duration `mapstructure:"min_wait_interval"`
MaxWaitInterval time.Duration `mapstructure:"max_wait_interval"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new values should go to the cli reference doc page.
I also suggest to start leaner, maybe just Enabled & MaxRetries to kick it off? Maybe even just Enabled?
I believe the defaults you set would be good enough for almost everyone. I don't feel like we have enough information to determine users need to configure intervals for now.. We can always add more custom config later, removing it is the hard part..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of providing fewer configuration options. We lose the flexibility to configure for optimal results, and this way, we'll never get feedback on which settings work for different use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's adding scope and friction. Instead of having "retry mechanism" for lakectl, we have "configurable retry mechanism". The gap isn't big coding-wise, I agree. But now I have to ask annoying questions regarding usage, like MaxRetries non-positive, MaxWaitInterval smaller than MinWaitInterval, why choose these values and not TotalWaitTime, etc. Without someone requesting something specific we're guessing what they need here, and then changing it might be a breaking change in the cli configuration.

cmd/lakectl/cmd/retry_client.go Outdated Show resolved Hide resolved
@@ -140,6 +147,10 @@ const (
allowEmptyMsgFlagName = "allow-empty-message"
fmtErrEmptyMsg = `commit with no message without specifying the "--allow-empty-message" flag`
metaFlagName = "meta"

defaultMaxRetries = 3
defaultMaxRetryInterval = 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too low IMO

Suggested change
defaultMaxRetryInterval = 1 * time.Second
defaultMaxRetryInterval = 5 * time.Second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a different value, but raised the max interval.

cmd/lakectl/cmd/retry_client.go Show resolved Hide resolved
cmd/lakectl/cmd/retry_client.go Show resolved Hide resolved
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comments other than not understanding the motivation for the chosen default values.

Comment on lines 151 to 153
defaultMaxRetries = 3
defaultMaxRetryInterval = 1 * time.Second
defaultMinRetryInterval = 200 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these numbers. Suppose the HTTP server never sends a "retry after" header (I don't think lakeFS does). Then we have 3 retries, starting with 0.2 s and going up to 1 second. But 3 retries will sleep for 0.2s and then for 0.4s. When does this max retry interval ever come into play?

If we had 6 retries, then IIUC our wait sequence would be (in seconds) 0.2, 0.4, 0.8, 1.0, 1.0, 1.0.

For example, AFAICT this is the default configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had 6 retries, then IIUC our wait sequence would be (in seconds) 0.2, 0.4, 0.8, 1.0, 1.0, 1.0

Yes, that's accurate. I'm not sure why this is a problem, but I don't mind using the package defaults instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I still prefer that we start with leaner configuration but won't block on it..

@eladlachmi
Copy link
Contributor Author

I want to take this opportunity to say that @arielshaqed was right! I should unwrap the error 😊

@eladlachmi eladlachmi merged commit 3821ff0 into master May 6, 2024
35 checks passed
@eladlachmi eladlachmi deleted the 7705-retryable-http-client branch May 6, 2024 12:44
emulatorchen pushed a commit to emulatorchen/lakeFS that referenced this pull request May 27, 2024
* lakectl - Support Retries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/lakectl Issues related to lakeFS' command line interface (lakectl) include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lakectl - retry mechanism
4 participants