Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttler: stats in /debug/vars #10443

Merged
merged 19 commits into from
Jun 22, 2022

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jun 6, 2022

Description

This PR exposes throttler metrics on /debug/vars. For example:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

The above shows us the aggregated metrics for existing metrics (first two lines), then check results for each app:

  • Total means how many checks were made by the app
  • Error are how many times the throttler returned with non-success, out of total
  • Any is a combination of all apps
  • MysqlShard are checks for shard lag (the standard /throttler/check API call)
  • MysqlSelf are checks for the state of the specific tablet's MySQL (/throttler/check-self API call)

Implementation notes

The metrics exported here require a float64 gauge. see for example throttler.aggregated.mysql.shard, which tells us the replicatoin lag on a shard. It is imperative that we have subsecond resolution, and a fraction number makes sense (it would be possible to achieve the same with uint64 as nanoseconds, but we inherit the fraction behavior from freno, and it's been in vitess for multiple versions now).

To that effect, I created Gauge64, which in turn means changes in prometheus, opentsdb, statsd, exporter, ....
Notable, exporter assumes all counters/gauges are int64 based; notice I haven't found a good solution and implemented like so:

func (e *Exporter) NewGaugeFloat64(name string, help string) *stats.GaugeFloat64 {
	return nil
}

Please review carefully those parts and let me know if there is a risk in there.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Initial PR had these variables under a different format:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

This comment is updated to reflect the new format.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Skip Upgrade Downgrade labels Jun 6, 2022
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has the correct release notes label. release notes none should only be used for PRs that are so trivial that they need not be included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

(onlineddl_vrepl_stress_suite) mysql80 CI failure is fine. Fixed earlier in #10441 , not merging main because of high CI contention.

@deepthi
Copy link
Member

deepthi commented Jun 6, 2022

Porting over my question from the previous PR:

  1. Why are the metric names formatted the way they are? Vitess convention would be ThrottlerAggregatedMysqlSelf versus throttler.aggregated.mysql.self
  2. What do they look like if you access the same metrics from the /metrics endpoint?

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Jun 6, 2022

Why are the metric names formatted the way they are?

To begin with, that's how they were named, imported from freno, because the dot . is a significant delimiter in metrics systems, such as Graphite and DataDog, which makes it then easy to look for throttler.aggregated.%.self for example.

But, also, the metrics names are parameterized; mysql, self, shard are parameterized parts. Let's say we can call them MySQL, Self, Shard because we control them -- the app names are also parameterized. gh-ost, vreplication, ...

The existing vitess names like RowStreamerMaxInnoDBTrxHistLen are constants.

I do see examples of parameterized connection pool params.

But I don't have clarity on how we might deal with parameterized app names. I also question the value of the current vitess naming scheme? What is the advantage of single word CamelCase?

@deepthi
Copy link
Member

deepthi commented Jun 6, 2022

I also question the value of the current vitess naming scheme? What is the advantage of single word CamelCase?

That is a valid question :)
When you get metrics from /metrics, CamelCase gets converted to snake_case. E.g. AppConnPoolActive becomes vttablet_app_conn_pool_active (also prefixed with the binary from which the metrics are being emitted).
Hence my second question:

What do they look like if you access the same metrics from the /metrics endpoint?

What I haven't looked at is whether this conversion is something we have implemented in our own Prometheus backend code.

@deepthi
Copy link
Member

deepthi commented Jun 6, 2022

Looks like we do convert . to _ for Prometheus metrics, but it will be nice to verify on a running instance with these new metrics that it works as expected.
https://github.com/vitessio/vitess/blob/main/go/stats/snake_case_converter.go#L58
The only rationale for CamelCase at this point is consistency. /debug/vars is not meant to be piped to production metric systems, for that we have /metrics. And an integration with something like DataDog would require someone to create a new stats backend for it similar to what we have for prometheus, and that backend would do the conversion to the desired naming convention.

@shlomi-noach
Copy link
Contributor Author

This is what /metrics looks like:

$ curl -s http://127.0.0.1:15100/metrics | grep throttler
# HELP vttablet_throttler_aggregated_mysql_self aggregated value for mysql.self
# TYPE vttablet_throttler_aggregated_mysql_self gauge
vttablet_throttler_aggregated_mysql_self 60620.693474
# HELP vttablet_throttler_aggregated_mysql_shard aggregated value for mysql.shard
# TYPE vttablet_throttler_aggregated_mysql_shard gauge
vttablet_throttler_aggregated_mysql_shard 60620.497982
# HELP vttablet_throttler_check_any_error total number of failed checks
# TYPE vttablet_throttler_check_any_error counter
vttablet_throttler_check_any_error 484982
# HELP vttablet_throttler_check_any_mysql_self_error 
# TYPE vttablet_throttler_check_any_mysql_self_error counter
vttablet_throttler_check_any_mysql_self_error 242490
# HELP vttablet_throttler_check_any_mysql_self_total 
# TYPE vttablet_throttler_check_any_mysql_self_total counter
vttablet_throttler_check_any_mysql_self_total 242518
# HELP vttablet_throttler_check_any_mysql_shard_error 
# TYPE vttablet_throttler_check_any_mysql_shard_error counter
vttablet_throttler_check_any_mysql_shard_error 242492
# HELP vttablet_throttler_check_any_mysql_shard_total 
# TYPE vttablet_throttler_check_any_mysql_shard_total counter
vttablet_throttler_check_any_mysql_shard_total 242518
# HELP vttablet_throttler_check_any_total total number of checks
# TYPE vttablet_throttler_check_any_total counter
vttablet_throttler_check_any_total 485036
# HELP vttablet_throttler_check_mysql_self_seconds_since_healthy seconds since last healthy cehck for mysql.self
# TYPE vttablet_throttler_check_mysql_self_seconds_since_healthy gauge
vttablet_throttler_check_mysql_self_seconds_since_healthy 60619
# HELP vttablet_throttler_check_mysql_shard_seconds_since_healthy seconds since last healthy cehck for mysql.shard
# TYPE vttablet_throttler_check_mysql_shard_seconds_since_healthy gauge
vttablet_throttler_check_mysql_shard_seconds_since_healthy 60619
# HELP vttablet_throttler_check_vitess_error total number of failed checks for vitess
# TYPE vttablet_throttler_check_vitess_error counter
vttablet_throttler_check_vitess_error 484982
# HELP vttablet_throttler_check_vitess_mysql_self_error 
# TYPE vttablet_throttler_check_vitess_mysql_self_error counter
...

@shlomi-noach
Copy link
Contributor Author

/debug/vars is not meant to be piped to production metric systems, for that we have /metrics

TIL and I really wasn't aware of /metrics until this issue.

@shlomi-noach
Copy link
Contributor Author

Let me look into camel casing metric names.

@shlomi-noach
Copy link
Contributor Author

Converting to Draft while looking into a few things

@shlomi-noach
Copy link
Contributor Author

@deepthi how about the following:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

$ curl -s http://127.0.0.1:15100/metrics | grep -i throttler | grep -v pool
# HELP vttablet_throttler_aggregated_mysql_self aggregated value for mysql.self
# TYPE vttablet_throttler_aggregated_mysql_self gauge
vttablet_throttler_aggregated_mysql_self 0.827354
# HELP vttablet_throttler_aggregated_mysql_shard aggregated value for mysql.shard
# TYPE vttablet_throttler_aggregated_mysql_shard gauge
vttablet_throttler_aggregated_mysql_shard 0.591876
# HELP vttablet_throttler_check_any_error total number of failed checks
# TYPE vttablet_throttler_check_any_error counter
vttablet_throttler_check_any_error 27
# HELP vttablet_throttler_check_any_mysql_self_error 
# TYPE vttablet_throttler_check_any_mysql_self_error counter
vttablet_throttler_check_any_mysql_self_error 13
# HELP vttablet_throttler_check_any_mysql_self_total 
# TYPE vttablet_throttler_check_any_mysql_self_total counter
vttablet_throttler_check_any_mysql_self_total 57
# HELP vttablet_throttler_check_any_mysql_shard_error 
# TYPE vttablet_throttler_check_any_mysql_shard_error counter
vttablet_throttler_check_any_mysql_shard_error 14
# HELP vttablet_throttler_check_any_mysql_shard_total 
# TYPE vttablet_throttler_check_any_mysql_shard_total counter
vttablet_throttler_check_any_mysql_shard_total 64
# HELP vttablet_throttler_check_any_total total number of checks
# TYPE vttablet_throttler_check_any_total counter
vttablet_throttler_check_any_total 121
# HELP vttablet_throttler_check_mysql_self_seconds_since_healthy seconds since last healthy cehck for mysql.self
# TYPE vttablet_throttler_check_mysql_self_seconds_since_healthy gauge
vttablet_throttler_check_mysql_self_seconds_since_healthy 0
# HELP vttablet_throttler_check_mysql_shard_seconds_since_healthy seconds since last healthy cehck for mysql.shard
# TYPE vttablet_throttler_check_mysql_shard_seconds_since_healthy gauge
vttablet_throttler_check_mysql_shard_seconds_since_healthy 0
# HELP vttablet_throttler_probes_latency probes latency
# TYPE vttablet_throttler_probes_latency gauge
vttablet_throttler_probes_latency 347382
# HELP vttablet_throttler_probes_total total probes
# TYPE vttablet_throttler_probes_total counter
vttablet_throttler_probes_total 114

In the above:

  • everything is camel cased
  • I chose to (meanwhile?) not provide stats for specific apps/workflows. It can get ugly and spammy with UUIDs, an dcan cause bloating of /debug/vars and bloating of memory.

I confess to dislike the CamelCase approach, because in a scenario where I need more than one word to describe something, such as in the above SecondsSinceHealthy, I want that description to be atomic, rather then be split into three parts.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…trics

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach marked this pull request as ready for review June 7, 2022 11:26
@deepthi
Copy link
Member

deepthi commented Jun 10, 2022

In terms of naming, this now looks good.
I previously forgot to address another part of your comments, apologies for that, and I'll do it now.

But, also, the metrics names are parameterized; mysql, self, shard are parameterized parts. Let's say we can call them MySQL, Self, Shard because we control them -- the app names are also parameterized. gh-ost, vreplication, ...

The existing vitess names like RowStreamerMaxInnoDBTrxHistLen are constants.

I do see examples of parameterized connection pool params.

But I don't have clarity on how we might deal with parameterized app names.

The stats package allows you to attach labels to the same metric. For example,

# TYPE vttablet_query_row_counts counter
vttablet_query_row_counts{plan="Insert",table="corder"} 5
vttablet_query_row_counts{plan="Insert",table="customer"} 5
vttablet_query_row_counts{plan="Insert",table="product"} 2
vttablet_query_row_counts{plan="Select",table="corder"} 0
vttablet_query_row_counts{plan="Select",table="customer"} 0
vttablet_query_row_counts{plan="Select",table="dual"} 0
vttablet_query_row_counts{plan="Select",table="product"} 0

labels don't have to be any particular case as you can see from this example.
If self/shard are values of a particular property, then the property can be defined as a label, and these become the values that we emit for that label. Similarly app can be a label with certain values that we produce.
Popular monitoring tools all support querying / filtering by labels, so this approach works well for not needing to add new metrics every time we decide to support a new value for a certain property of the metric.

Does this help?

@shlomi-noach
Copy link
Contributor Author

Does this help?

Yes, thank you!

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added release notes and removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) labels Jun 19, 2022
@@ -0,0 +1,95 @@
/*
Copyright 2019 The Vitess Authors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

t.Errorf("want %#v, got %#v", v, gotv)
}
v.Set(3.14)
if v.Get() != 3.14 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I personally find stretchr assertions like assert.Equal(t, 3.14, v.Get()) much easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right; I copied+pasted existing tests and kept the original code conventions (I assume this was written way before testify was in use). I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

for _, tc := range tt {
t.Run(tc.word, func(t *testing.T) {
camel := SingleWordCamel(tc.word)
assert.Equal(t, tc.expect, camel)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

@@ -307,6 +307,10 @@ func (e *Exporter) NewGauge(name string, help string) *stats.Gauge {
return lvar
}

func (e *Exporter) NewGaugeFloat64(name string, help string) *stats.GaugeFloat64 {
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why this is returning nil. Maybe add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See original comment, where this is explained. I'll also add as code comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach merged commit bddc71e into vitessio:main Jun 22, 2022
@shlomi-noach shlomi-noach deleted the throttler-stats-in-debug-vars branch June 22, 2022 06:38
@shlomi-noach shlomi-noach mentioned this pull request Jun 23, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants