Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics that would be good to have #207

Closed
elsmorian opened this issue Sep 25, 2017 · 5 comments
Closed

Metrics that would be good to have #207

elsmorian opened this issue Sep 25, 2017 · 5 comments
Assignees

Comments

@elsmorian
Copy link
Contributor

After a chat with @adejanovski about if Reaper exported metrics, he mentioned its currently in master but not many metrics were exported yet, and if I had any requests to make a GitHub issue. For us, useful metrics would be:

  • Number of segments pending to be repaired (so we can chart progress over time, similar to Cassandra's own Repair PendingTasks metric
  • Number of segments repaired per second (likely to be low for big repairs but still handy)
  • Number of postponed repair events due to high load / repairs already running per second.
@elsmorian
Copy link
Contributor Author

Oh, and the current number of nodes in each data centre that are up or down, that would be super helpful!

@rzvoncek
Copy link
Contributor

Hi!

I'll look into this one. And while I'm at it, I'll add two more metrics:

  • repair progress (per cluster): once plotted in a dashboard, repair progress being flat will nicely show stalled repairs.
    • This is somewhat similar to the number of segments above, let me see which one (or both) to include.
  • time of last successful repair (per cluster): will make it easy to spot missing repairs on cluster

@rzvoncek rzvoncek self-assigned this Oct 25, 2017
@elsmorian
Copy link
Contributor Author

@rzvoncek 👍 totally agree on that, thanks for having a look into this!

@rzvoncek
Copy link
Contributor

Hi.

I've ended up not adding the postpones metric, because there already is something similar:

"io.cassandrareaper.service.SegmentRunner.postpone.null.testcluster.keyspace1" : {
      "count" : 13
    },

The null should be a coordinator host, but for some reason it doesn't populate for me :-/.

rzvoncek added a commit that referenced this issue Oct 31, 2017
Add metrics for repair progress + time since last repair. Fixes #207.
@elsmorian
Copy link
Contributor Author

Thanks for adding these in :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants