Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read timeout metrics are misleading #876

Closed
glommer opened this issue Apr 3, 2020 · 5 comments · Fixed by #877
Closed

read timeout metrics are misleading #876

glommer opened this issue Apr 3, 2020 · 5 comments · Fixed by #877

Comments

@glommer
Copy link
Contributor

glommer commented Apr 3, 2020

This is the metric we use for the "Read timeouts" graph:

$func(delta(scylla_storage_proxy_coordinator_read_timeouts{instance=~"[[node]]",cluster=~"$cluster|$^", dc=~"$dc", shard=~"[[shard]]"}[1m])) by ([[by]])

The reason it is misleading is that while it says "Read", there are many types of reads. Each with its own timeout metrics. For instance, range queries are accumulated in the metric scylla_storage_proxy_coordinator_range_timeouts.

With the introduction of LWT there are now also cas reads and cas writes.

Currently users are blind to that. I propose that we accumulate all reads into one in the Overview dashboard, and show per-type metric in the detailed dashboard.

Also @amnonh for newer versions please consider patching Scylla as well to use labels for each of those operation types instead of explicit names. With labels the dashboards would have worked out of the box and for free.

@amnonh
Copy link
Collaborator

amnonh commented Apr 5, 2020

@glommer for future issues, can you please add the dashboard name and Scylla-version (when applicable).
Note that you can use the report an issue button that would add that relevant information for you.

@glommer
Copy link
Contributor Author

glommer commented Apr 6, 2020

Timeouts appear in both the overview and detailed dashboards.
As far as I know all versions are affected.

@amnonh
Copy link
Collaborator

amnonh commented Apr 6, 2020

not all of them has LWT support, but I'll verify what is applicable

@glommer
Copy link
Contributor Author

glommer commented Apr 6, 2020

This has nothing to do with LWT.
The main issue I am complaining about is range reads vs reads. This is the case ever since all versions.

All I am saying is that LWT introduces new read and write types as well, so if we'll fix it, we should already fix in a way that doesn't stumble upon the very next issue.

Please consider patching scylla to use labels to make this easier in the future, but for now we need range and normal reads separated.

@amnonh
Copy link
Collaborator

amnonh commented Apr 6, 2020

I've already opened a Scylla issue about it but it's too late for OS 4.0, and I'm fixing the rest for the next Monitoring release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants