Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify what subsystem is causing the biggest difference between the 99th and 50th percentile for upload performance between the public and select networks #6880

Closed
iglesiasbrandon opened this issue Mar 26, 2024 · 2 comments

Comments

@iglesiasbrandon
Copy link
Contributor

iglesiasbrandon commented Mar 26, 2024

Background:

Slack channel: #strikeforce-upload-performance-variation

This issue is about following up on two of our OKRs:
Get Veeam Backup performance on par with Wasabi
Pass Veeam Ready

the edge team has been monitoring our speed for some time, and at this point believes we're "fast enough" usually, but the problem remains our 99th percentile performance. the edge team says that occasionally, upload (and download? but we're mostly focused on upload) performance slows enough that that variation tanks our overall score. the edge team feels that this variation is within the uplink envelope, and so has passed the torch on to us to try and figure out what's next.

so, this strikeforce's goal:

  1. identify which subsystems are causing the most difference between our 50th and 99th percentile performance
  2. fix those subsystems

we suspect that the problem is largely the storage node network, and there are different issues facing the public network and the select network, so we want to investigate both

a bunch of things we can do, but perhaps the best starting point for #1 is to:

  • try a bunch of veeam ready size uploads
  • take traces of each upload
  • for each subsystem, keep track of the overall time distribution out of the trace
  • find the subsystems that have the biggest difference between 50th and 99th percentile?
  • repeat recursively
@iglesiasbrandon iglesiasbrandon changed the title Identify what subsystem is causing the biggest difference between the 99th and 50th percentile on the public and select networks Identify what subsystem is causing the biggest difference between the 99th and 50th percentile for upload performance between the public and select networks Mar 26, 2024
@profclems profclems removed their assignment Apr 23, 2024
@profclems
Copy link
Member

I will unassign from this issue since the effort has largely been on the team-delivery side

@storjrobot
Copy link

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/updates-on-test-data/26034/40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done/Deployed
Development

No branches or pull requests

4 participants