Throttler: stats in /debug/vars #10443

shlomi-noach · 2022-06-06T12:22:34Z

Description

This PR exposes throttler metrics on /debug/vars. For example:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

The above shows us the aggregated metrics for existing metrics (first two lines), then check results for each app:

Total means how many checks were made by the app
Error are how many times the throttler returned with non-success, out of total
Any is a combination of all apps
MysqlShard are checks for shard lag (the standard /throttler/check API call)
MysqlSelf are checks for the state of the specific tablet's MySQL (/throttler/check-self API call)

Implementation notes

The metrics exported here require a float64 gauge. see for example throttler.aggregated.mysql.shard, which tells us the replicatoin lag on a shard. It is imperative that we have subsecond resolution, and a fraction number makes sense (it would be possible to achieve the same with uint64 as nanoseconds, but we inherit the fraction behavior from freno, and it's been in vitess for multiple versions now).

To that effect, I created Gauge64, which in turn means changes in prometheus, opentsdb, statsd, exporter, ....
Notable, exporter assumes all counters/gauges are int64 based; notice I haven't found a good solution and implemented like so:

func (e *Exporter) NewGaugeFloat64(name string, help string) *stats.GaugeFloat64 {
	return nil
}

Please review carefully those parts and let me know if there is a risk in there.

Related Issue(s)

Checklist

"Backport me!" label has been added if this change should be backported
Tests were added or are not required
Documentation was added or is not required

Initial PR had these variables under a different format:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

This comment is updated to reflect the new format.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

github-actions · 2022-06-06T12:33:36Z

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2022-06-06T14:16:02Z

(onlineddl_vrepl_stress_suite) mysql80 CI failure is fine. Fixed earlier in #10441 , not merging main because of high CI contention.

deepthi · 2022-06-06T18:13:49Z

Porting over my question from the previous PR:

Why are the metric names formatted the way they are? Vitess convention would be ThrottlerAggregatedMysqlSelf versus throttler.aggregated.mysql.self
What do they look like if you access the same metrics from the /metrics endpoint?

shlomi-noach · 2022-06-06T18:38:39Z

Why are the metric names formatted the way they are?

To begin with, that's how they were named, imported from freno, because the dot . is a significant delimiter in metrics systems, such as Graphite and DataDog, which makes it then easy to look for throttler.aggregated.%.self for example.

But, also, the metrics names are parameterized; mysql, self, shard are parameterized parts. Let's say we can call them MySQL, Self, Shard because we control them -- the app names are also parameterized. gh-ost, vreplication, ...

The existing vitess names like RowStreamerMaxInnoDBTrxHistLen are constants.

I do see examples of parameterized connection pool params.

But I don't have clarity on how we might deal with parameterized app names. I also question the value of the current vitess naming scheme? What is the advantage of single word CamelCase?

deepthi · 2022-06-06T19:57:09Z

I also question the value of the current vitess naming scheme? What is the advantage of single word CamelCase?

That is a valid question :)
When you get metrics from /metrics, CamelCase gets converted to snake_case. E.g. AppConnPoolActive becomes vttablet_app_conn_pool_active (also prefixed with the binary from which the metrics are being emitted).
Hence my second question:

What do they look like if you access the same metrics from the /metrics endpoint?

What I haven't looked at is whether this conversion is something we have implemented in our own Prometheus backend code.

deepthi · 2022-06-06T21:32:21Z

Looks like we do convert . to _ for Prometheus metrics, but it will be nice to verify on a running instance with these new metrics that it works as expected.
https://github.com/vitessio/vitess/blob/main/go/stats/snake_case_converter.go#L58
The only rationale for CamelCase at this point is consistency. /debug/vars is not meant to be piped to production metric systems, for that we have /metrics. And an integration with something like DataDog would require someone to create a new stats backend for it similar to what we have for prometheus, and that backend would do the conversion to the desired naming convention.

shlomi-noach · 2022-06-07T05:27:35Z

This is what /metrics looks like:

$ curl -s http://127.0.0.1:15100/metrics | grep throttler
# HELP vttablet_throttler_aggregated_mysql_self aggregated value for mysql.self
# TYPE vttablet_throttler_aggregated_mysql_self gauge
vttablet_throttler_aggregated_mysql_self 60620.693474
# HELP vttablet_throttler_aggregated_mysql_shard aggregated value for mysql.shard
# TYPE vttablet_throttler_aggregated_mysql_shard gauge
vttablet_throttler_aggregated_mysql_shard 60620.497982
# HELP vttablet_throttler_check_any_error total number of failed checks
# TYPE vttablet_throttler_check_any_error counter
vttablet_throttler_check_any_error 484982
# HELP vttablet_throttler_check_any_mysql_self_error 
# TYPE vttablet_throttler_check_any_mysql_self_error counter
vttablet_throttler_check_any_mysql_self_error 242490
# HELP vttablet_throttler_check_any_mysql_self_total 
# TYPE vttablet_throttler_check_any_mysql_self_total counter
vttablet_throttler_check_any_mysql_self_total 242518
# HELP vttablet_throttler_check_any_mysql_shard_error 
# TYPE vttablet_throttler_check_any_mysql_shard_error counter
vttablet_throttler_check_any_mysql_shard_error 242492
# HELP vttablet_throttler_check_any_mysql_shard_total 
# TYPE vttablet_throttler_check_any_mysql_shard_total counter
vttablet_throttler_check_any_mysql_shard_total 242518
# HELP vttablet_throttler_check_any_total total number of checks
# TYPE vttablet_throttler_check_any_total counter
vttablet_throttler_check_any_total 485036
# HELP vttablet_throttler_check_mysql_self_seconds_since_healthy seconds since last healthy cehck for mysql.self
# TYPE vttablet_throttler_check_mysql_self_seconds_since_healthy gauge
vttablet_throttler_check_mysql_self_seconds_since_healthy 60619
# HELP vttablet_throttler_check_mysql_shard_seconds_since_healthy seconds since last healthy cehck for mysql.shard
# TYPE vttablet_throttler_check_mysql_shard_seconds_since_healthy gauge
vttablet_throttler_check_mysql_shard_seconds_since_healthy 60619
# HELP vttablet_throttler_check_vitess_error total number of failed checks for vitess
# TYPE vttablet_throttler_check_vitess_error counter
vttablet_throttler_check_vitess_error 484982
# HELP vttablet_throttler_check_vitess_mysql_self_error 
# TYPE vttablet_throttler_check_vitess_mysql_self_error counter
...

shlomi-noach · 2022-06-07T05:29:36Z

/debug/vars is not meant to be piped to production metric systems, for that we have /metrics

TIL and I really wasn't aware of /metrics until this issue.

shlomi-noach · 2022-06-07T05:30:12Z

Let me look into camel casing metric names.

shlomi-noach · 2022-06-07T06:36:54Z

Converting to Draft while looking into a few things

shlomi-noach · 2022-06-07T10:00:36Z

@deepthi how about the following:

$ curl -s http://127.0.0.1:15100/debug/vars | jq . | grep Throttler | grep -v Pool
  "ThrottlerAggregatedMysqlSelf": 0.191718,
  "ThrottlerAggregatedMysqlShard": 0.960054,
  "ThrottlerCheckAnyError": 27,
  "ThrottlerCheckAnyMysqlSelfError": 13,
  "ThrottlerCheckAnyMysqlSelfTotal": 38,
  "ThrottlerCheckAnyMysqlShardError": 14,
  "ThrottlerCheckAnyMysqlShardTotal": 42,
  "ThrottlerCheckAnyTotal": 80,
  "ThrottlerCheckMysqlSelfSecondsSinceHealthy": 0,
  "ThrottlerCheckMysqlShardSecondsSinceHealthy": 0,
  "ThrottlerProbesLatency": 355523,
  "ThrottlerProbesTotal": 74,

$ curl -s http://127.0.0.1:15100/metrics | grep -i throttler | grep -v pool
# HELP vttablet_throttler_aggregated_mysql_self aggregated value for mysql.self
# TYPE vttablet_throttler_aggregated_mysql_self gauge
vttablet_throttler_aggregated_mysql_self 0.827354
# HELP vttablet_throttler_aggregated_mysql_shard aggregated value for mysql.shard
# TYPE vttablet_throttler_aggregated_mysql_shard gauge
vttablet_throttler_aggregated_mysql_shard 0.591876
# HELP vttablet_throttler_check_any_error total number of failed checks
# TYPE vttablet_throttler_check_any_error counter
vttablet_throttler_check_any_error 27
# HELP vttablet_throttler_check_any_mysql_self_error 
# TYPE vttablet_throttler_check_any_mysql_self_error counter
vttablet_throttler_check_any_mysql_self_error 13
# HELP vttablet_throttler_check_any_mysql_self_total 
# TYPE vttablet_throttler_check_any_mysql_self_total counter
vttablet_throttler_check_any_mysql_self_total 57
# HELP vttablet_throttler_check_any_mysql_shard_error 
# TYPE vttablet_throttler_check_any_mysql_shard_error counter
vttablet_throttler_check_any_mysql_shard_error 14
# HELP vttablet_throttler_check_any_mysql_shard_total 
# TYPE vttablet_throttler_check_any_mysql_shard_total counter
vttablet_throttler_check_any_mysql_shard_total 64
# HELP vttablet_throttler_check_any_total total number of checks
# TYPE vttablet_throttler_check_any_total counter
vttablet_throttler_check_any_total 121
# HELP vttablet_throttler_check_mysql_self_seconds_since_healthy seconds since last healthy cehck for mysql.self
# TYPE vttablet_throttler_check_mysql_self_seconds_since_healthy gauge
vttablet_throttler_check_mysql_self_seconds_since_healthy 0
# HELP vttablet_throttler_check_mysql_shard_seconds_since_healthy seconds since last healthy cehck for mysql.shard
# TYPE vttablet_throttler_check_mysql_shard_seconds_since_healthy gauge
vttablet_throttler_check_mysql_shard_seconds_since_healthy 0
# HELP vttablet_throttler_probes_latency probes latency
# TYPE vttablet_throttler_probes_latency gauge
vttablet_throttler_probes_latency 347382
# HELP vttablet_throttler_probes_total total probes
# TYPE vttablet_throttler_probes_total counter
vttablet_throttler_probes_total 114

In the above:

everything is camel cased
I chose to (meanwhile?) not provide stats for specific apps/workflows. It can get ugly and spammy with UUIDs, an dcan cause bloating of /debug/vars and bloating of memory.

I confess to dislike the CamelCase approach, because in a scenario where I need more than one word to describe something, such as in the above SecondsSinceHealthy, I want that description to be atomic, rather then be split into three parts.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…trics Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

deepthi · 2022-06-10T18:00:20Z

In terms of naming, this now looks good.
I previously forgot to address another part of your comments, apologies for that, and I'll do it now.

But, also, the metrics names are parameterized; mysql, self, shard are parameterized parts. Let's say we can call them MySQL, Self, Shard because we control them -- the app names are also parameterized. gh-ost, vreplication, ...

The existing vitess names like RowStreamerMaxInnoDBTrxHistLen are constants.

I do see examples of parameterized connection pool params.

But I don't have clarity on how we might deal with parameterized app names.

The stats package allows you to attach labels to the same metric. For example,

# TYPE vttablet_query_row_counts counter
vttablet_query_row_counts{plan="Insert",table="corder"} 5
vttablet_query_row_counts{plan="Insert",table="customer"} 5
vttablet_query_row_counts{plan="Insert",table="product"} 2
vttablet_query_row_counts{plan="Select",table="corder"} 0
vttablet_query_row_counts{plan="Select",table="customer"} 0
vttablet_query_row_counts{plan="Select",table="dual"} 0
vttablet_query_row_counts{plan="Select",table="product"} 0

labels don't have to be any particular case as you can see from this example.
If self/shard are values of a particular property, then the property can be defined as a label, and these become the values that we emit for that label. Similarly app can be a label with certain values that we produce.
Popular monitoring tools all support querying / filtering by labels, so this approach works well for not needing to add new metrics every time we decide to support a new value for a certain property of the metric.

Does this help?

shlomi-noach · 2022-06-19T04:06:23Z

Does this help?

Yes, thank you!

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

systay · 2022-06-20T06:24:19Z

go/stats/counter_map.go

@@ -0,0 +1,95 @@
+/*
+Copyright 2019 The Vitess Authors.


systay · 2022-06-20T06:26:44Z

go/stats/counter_test.go

+		t.Errorf("want %#v, got %#v", v, gotv)
+	}
+	v.Set(3.14)
+	if v.Get() != 3.14 {


nit: I personally find stretchr assertions like assert.Equal(t, 3.14, v.Get()) much easier to read

You're absolutely right; I copied+pasted existing tests and kept the original code conventions (I assume this was written way before testify was in use). I'll update.

systay · 2022-06-20T06:28:07Z

go/textutil/strings_test.go

+	for _, tc := range tt {
+		t.Run(tc.word, func(t *testing.T) {
+			camel := SingleWordCamel(tc.word)
+			assert.Equal(t, tc.expect, camel)


systay · 2022-06-20T06:28:56Z

go/vt/servenv/exporter.go

@@ -307,6 +307,10 @@ func (e *Exporter) NewGauge(name string, help string) *stats.Gauge {
 	return lvar
 }

+func (e *Exporter) NewGaugeFloat64(name string, help string) *stats.GaugeFloat64 {
+	return nil


I don't follow why this is returning nil. Maybe add a comment?

See original comment, where this is explained. I'll also add as code comment.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added 9 commits May 26, 2022 14:16

Tablet throttler: serve metrics on /throttler/metrics

a2a9481

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

release notes

1a59e95

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Merge branch 'main' into throttler-metrics-in-debug-vars

f4a66ea

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

GaugeFloat64

b1c61af

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

counter map

5bf7059

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

GaugeFloat64

3eceb3c

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

remove /throttler/metrics endpoint

ef632f4

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

move away from rcrowley/go-metrics and into vitess's stats

6b1183b

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

update release notes

976b7bc

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Skip Upgrade Downgrade labels Jun 6, 2022

shlomi-noach requested review from harshit-gangal, systay, deepthi and ajm188 as code owners June 6, 2022 12:22

shlomi-noach mentioned this pull request Jun 6, 2022

Tablet throttler: serve metrics on /throttler/metrics #10370

Closed

3 tasks

Merge branch 'main' into throttler-stats-in-debug-vars

90a8fca

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added 2 commits June 6, 2022 15:35

internal app name is 'vitess'

108dcf9

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

internal app name is 'vitess'

f54d5e5

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request Jun 7, 2022

Feature Request: Tablet Throttler - retain/expose counter of ThrottleMetric for more flexible tablet throttling #10407

Closed

shlomi-noach marked this pull request as draft June 7, 2022 06:36

shlomi-noach added 3 commits June 7, 2022 14:22

WingleWordCamel

6aea651

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

SingleWordCamel tests

06f7353

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

CamelCase for /debug/var metric names. Do not include app-specific me…

1f6abc5

…trics Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach marked this pull request as ready for review June 7, 2022 11:26

Merge branch 'main' into throttler-stats-in-debug-vars

ed1f813

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added release notes and removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) labels Jun 19, 2022

systay reviewed Jun 20, 2022

View reviewed changes

systay approved these changes Jun 20, 2022

View reviewed changes

shlomi-noach added 3 commits June 20, 2022 11:04

use testify

9fc4561

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

copyright

7320d94

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

some code comments

05101f0

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach merged commit bddc71e into vitessio:main Jun 22, 2022

shlomi-noach deleted the throttler-stats-in-debug-vars branch June 22, 2022 06:38

deepthi mentioned this pull request Jun 23, 2022

[14.0] Update and prepare the v14.0.0 summary #10569

Merged

3 tasks

shlomi-noach mentioned this pull request Jun 23, 2022

Throttler stats: amendment #10572

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttler: stats in /debug/vars #10443

Throttler: stats in /debug/vars #10443

shlomi-noach commented Jun 6, 2022 •

edited

Loading

github-actions bot commented Jun 6, 2022

shlomi-noach commented Jun 6, 2022

deepthi commented Jun 6, 2022

shlomi-noach commented Jun 6, 2022 •

edited

Loading

deepthi commented Jun 6, 2022

deepthi commented Jun 6, 2022 •

edited

Loading

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

deepthi commented Jun 10, 2022

shlomi-noach commented Jun 19, 2022

systay Jun 20, 2022

shlomi-noach Jun 20, 2022

systay Jun 20, 2022

shlomi-noach Jun 20, 2022

shlomi-noach Jun 20, 2022

systay Jun 20, 2022

systay Jun 20, 2022

shlomi-noach Jun 20, 2022

shlomi-noach Jun 20, 2022

Throttler: stats in /debug/vars #10443

Throttler: stats in /debug/vars #10443

Conversation

shlomi-noach commented Jun 6, 2022 • edited Loading

Description

Implementation notes

Related Issue(s)

Checklist

github-actions bot commented Jun 6, 2022

Review Checklist

General

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

shlomi-noach commented Jun 6, 2022

deepthi commented Jun 6, 2022

shlomi-noach commented Jun 6, 2022 • edited Loading

deepthi commented Jun 6, 2022

deepthi commented Jun 6, 2022 • edited Loading

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

shlomi-noach commented Jun 7, 2022

deepthi commented Jun 10, 2022

shlomi-noach commented Jun 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlomi-noach commented Jun 6, 2022 •

edited

Loading

shlomi-noach commented Jun 6, 2022 •

edited

Loading

deepthi commented Jun 6, 2022 •

edited

Loading