New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add await metrics to "OS Metrics" dashboard #1343
Comments
Since scylla is using a user space i/o scheduler and will not submit items
to the disk more then the disk can "chew" these values are expected to be 0
or close to to 0.
The cases in which these values can be higher than 0 are:
- something else (aside of scylla) is using the disk (backup upload is an
example as well - and in such cases we should tune the system accordingly
to assure that manager-agent is not using to much bandwidth and that
scyllaallows enough bandwidth for the backup upload)
- or we have an issue with iotune / disk settings (cache in gcp causing).
Amnon this can also be an advisor input - e.g. a value higher than 0.5 can
be considered an indication that the system is not setup/tuned correctly.
…On Fri, Mar 26, 2021 at 5:50 PM Vladislav Zolotarov < ***@***.***> wrote:
@slivne <https://github.com/slivne> @amnonh <https://github.com/amnonh>
FYI
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2OCCDBRF62R336WUGD6SDTFSNM7ANCNFSM4Z3RCRJA>
.
|
Right. Not zero but it should be reasonably low. Note that it would also depend on the size of I/O operations taking place at a given time frame.
Good point. So, we should probably indicate when actual await times are higher than x10 than that value. Maybe even before... |
Depends on what "can chew" means. Note, that seastar's goal is to send to the disk as many data as it can complete within a given time. Strictly speaking it's not the same as "as many data as disk can process without delays". From what I see on both AWS and GCE instances pure writes are always ~2x "overdispatched" in the sense that disk cannot process everything right at once, and queues some data, but anyway manages to complete everything within the latency goal.
Default latency goal is 0.75ms. |
Absolutely. |
Not always. For reads -- yes. For writes I saw on AWS 4-disks raid that doing one-at-a-time 64k write req showed peak throughput. |
@vladzcloudius @slivne Close to zero is not something I can work with? I understand if it needs to be very low, can we agree on some value? |
@amnonh I'm not sure what you are asking here about. Could you, please, clarify? |
@vladzcloudius There are two issues here:
|
@amnonh
The "normal" value depends on the actual HW used for I/O. @xemul Can @amnonh In any case the value needs to be configurable: e.g. there are going to be very different values for NVMe, SDD and HDD. |
System information
Describe the feature and the current behavior/state.
"OS Metrics" dashboard is missing a very important graph: r-wait/w-wait disk graphs.
These are crucial when debugging latencies related issues.
Who will benefit with this feature?
Everybody
Any Other info.
Here are Prometheus query that may be used for r-wait:
More info may be found here: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics
The text was updated successfully, but these errors were encountered: