Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storagenode: decline uploads when there are too many live requests #2397

Merged
merged 8 commits into from Jul 3, 2019

Conversation

egonelbre
Copy link
Member

@egonelbre egonelbre commented Jul 1, 2019

Currently it's easy to overload storage nodes and crash them. This PR adds a maximum concurrent requests parameter which throttles upload requests.

Please describe the tests:

  • Test 1:
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

@egonelbre egonelbre requested a review from aleitner as a code owner July 1, 2019 07:42
@cla-bot cla-bot bot added the cla-signed label Jul 1, 2019
@egonelbre egonelbre added the Request Code Review Code review requested label Jul 1, 2019
@@ -55,6 +56,7 @@ type OldConfig struct {
// Config defines parameters for piecestore endpoint.
type Config struct {
ExpirationGracePeriod time.Duration `help:"how soon before expiration date should things be considered expired" default:"48h0m0s"`
MaxConcurrentRequests int `help:"how many concurrent requests are allowed, before uploads are rejected." default:"30"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would SNOs know what value to configure here?

Is there a way for the storage node to self-diagnose and determine if it is overloaded or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably with some kind of benchmarking by some other party. The issue is that it's not just about the storage node itself, but also about the network it's able to deliver.

I guess we could try monitoring bandwidth usage and when it isn't increasing anymore or load or memory usage etc. But these all are much more complicated solutions than "this is how much this storage node can serve".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s totally fine to have a default, if someone notices a lot of issues he just can tune it down. We had the same mechanic in V2 and that worked fine.
About the default itself, I would recommend something in the neighborhood of 5-10 max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what to put as limits for now. Is there an easy way to test how much we can handle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Server Node was handling 15-20 Requests max per second, when fully cached.
The Pi 3B+ is already overwhelmed with 3 requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could run a few test uploads on my local Network to see what i can come up with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the storagenodes could potentially obtain data from one or more satellites to help it decide how many requests it can handle and adjust the max.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a dynamic scaling that is just adding another layer of potential issues @phutchins
Lets make a good decision/average and update the docs accordingly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when you get a number from testing @stefanbenten

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test that after lunch 👍

@egonelbre egonelbre changed the title storagenode: decline upload requests when there are too many live requests storagenode: decline uploads when there are too many live requests Jul 1, 2019
@@ -74,6 +76,8 @@ type Endpoint struct {
orders orders.DB
usage bandwidth.DB
usedSerials UsedSerials

liveRequests int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make this be one of the first struct fields, so arm alignment is better guaranteed (https://golang.org/pkg/sync/atomic/#pkg-note-BUG)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this an issue for int64 only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah good point, maybe worth testing on an arm device

@stefanbenten stefanbenten added the Sprint Release Goal Sprint Release Goal label Jul 2, 2019
@@ -109,6 +109,8 @@ func (planet *Planet) newStorageNodes(count int, whitelistedSatelliteIDs []strin
StaticDir: filepath.Join(developmentRoot, "web/operator/"),
},
Storage2: piecestore.Config{
ExpirationGracePeriod: 0,
MaxConcurrentRequests: 100,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving testplanet to 100, because having this lower will probably break some tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth testing where the limit is, in another PR of course.

Copy link
Contributor

@stefanbenten stefanbenten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest looks good to me!

@@ -55,6 +56,7 @@ type OldConfig struct {
// Config defines parameters for piecestore endpoint.
type Config struct {
ExpirationGracePeriod time.Duration `help:"how soon before expiration date should things be considered expired" default:"48h0m0s"`
MaxConcurrentRequests int `help:"how many concurrent requests are allowed, before uploads are rejected." default:"6"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to define a devDefault and a releaseDefault?
to prevent storj-sim slowdowns?

@egonelbre egonelbre added the Reviewer Can Merge If all checks have passed, non-owner can merge PR label Jul 2, 2019
@egonelbre egonelbre merged commit 38f3d86 into master Jul 3, 2019
@egonelbre egonelbre deleted the ee/sno-backoff branch July 3, 2019 13:47
littleskunk pushed a commit that referenced this pull request Jul 5, 2019
@storjrobot
Copy link

This pull request has been mentioned on Storj Labs Community Forum: Decentralized Cloud Storage. There might be relevant details there:

https://forum.storj.io/t/changelog-v0-14-11/433/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR Sprint Release Goal Sprint Release Goal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants