Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add await metrics to "OS Metrics" dashboard #1343

Closed
vladzcloudius opened this issue Mar 26, 2021 · 10 comments · Fixed by #1409
Closed

Add await metrics to "OS Metrics" dashboard #1343

vladzcloudius opened this issue Mar 26, 2021 · 10 comments · Fixed by #1409

Comments

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Mar 26, 2021

System information

  • Scylla version (you are using): All of them
  • Are you willing to contribute it (Yes/No): I'll attach queries below

Describe the feature and the current behavior/state.
"OS Metrics" dashboard is missing a very important graph: r-wait/w-wait disk graphs.
These are crucial when debugging latencies related issues.

Who will benefit with this feature?
Everybody

Any Other info.
Here are Prometheus query that may be used for r-wait:

irate(node_disk_read_time_seconds_total[30s]) / irate(node_disk_reads_completed_total[30s])

More info may be found here: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics

@vladzcloudius
Copy link
Contributor Author

@slivne @amnonh FYI

@slivne
Copy link

slivne commented Apr 4, 2021 via email

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Apr 5, 2021

Since scylla is using a user space i/o scheduler and will not submit items to the disk more then the disk can "chew" these values are expected to be 0 or close to to 0.

Right. Not zero but it should be reasonably low. Note that it would also depend on the size of I/O operations taking place at a given time frame.

The cases in which these values can be higher than 0 are: - something else (aside of scylla) is using the disk (backup upload is an example as well - and in such cases we should tune the system accordingly to assure that manager-agent is not using to much bandwidth and that scyllaallows enough bandwidth for the backup upload) - or we have an issue with iotune / disk settings (cache in gcp causing). Amnon this can also be an advisor input - e.g. a value higher than 0.5 can be considered an indication that the system is not setup/tuned correctly.

Good point.
0.5 second is definitely a very high await value.
If memory serves me well iotune targets a specific max I/O latency.
@xemul Could you, please, tell me what is it exactly? (Let's call it X for now).

So, we should probably indicate when actual await times are higher than x10 than that value. Maybe even before...

@xemul
Copy link

xemul commented Apr 6, 2021

Since scylla is using a user space i/o scheduler and will not submit items to the disk more then the disk can "chew" these values are expected to be 0 or close to to 0.

Depends on what "can chew" means. Note, that seastar's goal is to send to the disk as many data as it can complete within a given time. Strictly speaking it's not the same as "as many data as disk can process without delays". From what I see on both AWS and GCE instances pure writes are always ~2x "overdispatched" in the sense that disk cannot process everything right at once, and queues some data, but anyway manages to complete everything within the latency goal.

0.5 second is definitely a very high await value.
If memory serves me well iotune targets a specific max I/O latency.
@xemul Could you, please, tell me what is it exactly? (Let's call it X for now).

Default latency goal is 0.75ms.

@vladzcloudius
Copy link
Contributor Author

Since scylla is using a user space i/o scheduler and will not submit items to the disk more then the disk can "chew" these values are expected to be 0 or close to to 0.

Depends on what "can chew" means. Note, that seastar's goal is to send to the disk as many data as it can complete within a given time. Strictly speaking it's not the same as "as many data as disk can process without delays". From what I see on both AWS and GCE instances pure writes are always ~2x "overdispatched" in the sense that disk cannot process everything right at once, and queues some data, but anyway manages to complete everything within the latency goal.

Absolutely.
@slivne note that there MUST be some concurrency if we want to get close to the optimum throughput. Hence the added latency (and therefore added await time) will be non-zero either. This is on top of the latency of handling a single request which will also be part of the resulting await time.

@xemul
Copy link

xemul commented Apr 7, 2021

note that there MUST be some concurrency if we want to get close to the optimum throughput

Not always. For reads -- yes. For writes I saw on AWS 4-disks raid that doing one-at-a-time 64k write req showed peak throughput.

@amnonh amnonh added this to the monitoring 3.8 milestone May 6, 2021
@amnonh
Copy link
Collaborator

amnonh commented May 18, 2021

@vladzcloudius @slivne Close to zero is not something I can work with? I understand if it needs to be very low, can we agree on some value?
On my laptop (with an nvme disk) I get the following:
image

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius @slivne Close to zero is not something I can work with? I understand if it needs to be very low, can we agree on some value?

@amnonh I'm not sure what you are asking here about. Could you, please, clarify?

@amnonh
Copy link
Collaborator

amnonh commented May 18, 2021

@vladzcloudius There are two issues here:

  1. the the comment for the graph, right now it's just say average time of read/write, do we want to say something more? Stating that it should be low is not very meaningful for the user, in my example, is that ok? how can the user know?
  2. For an advisor alert, I need to have a hard-coded value, so I need a number we can agree upon that if the average write or read will be higher it will generate an alert and suggest to check iotune.

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius There are two issues here:

1. the the comment for the graph, right now it's just say average time of read/write, do we want to say something more? Stating that it should be low is not very meaningful for the user, in my example, is that ok? how can the user know?

@amnonh
I believe what you have right now is good enough. No need to add anything.

2. For an advisor alert, I need to have a hard-coded value, so I need a number we can agree upon that if the average write or read will be higher it will generate an alert and suggest to check iotune.

The "normal" value depends on the actual HW used for I/O.
"Agreeing" on a fixed value here is not going to cut it.

@xemul Can iotune generate an expected await time?

@amnonh In any case the value needs to be configurable: e.g. there are going to be very different values for NVMe, SDD and HDD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants