Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for exporting metrics to Prometheus. #3784

Merged
merged 14 commits into from Apr 21, 2018

Conversation

@zmagg
Copy link
Collaborator

@zmagg zmagg commented Mar 28, 2018

Description

WIP PR to export metrics to Prometheus, as discussed in #3644

I've gotten feedback from @demmer on the approach in this PR, but I'd like to get some feedback from others as well, especially around metric names and the API exposed. Despite the long list of changes below, there are no breaking changes in this PR for anybody that relies on the expvar stats.

Metrics backend changes / naming stuff

  • Adds pull_backend.go which supports registering generic pull backends which can scrape for metrics.
  • Adds a Prometheus backend that includes custom collectors per metric type and creates constant, throwaway, metrics "on the fly" at scrape time. The Prometheus backend is set up to be configured as a plugin per Vitess component.
  • The Prometheus backend also programmatically converts the camel case metric names to snake case metric names at scrape time, to align with the Prometheus naming conventions. This includes special casing names that don't convert easily (like VSchema) (See: toSnake()) .
    Right now, it also adds the name of the component as a namespace to the metric, but that produces some metric names that have duplicate prefixes, as we already prefix a lot of the Vitess metric names with the component type (e.g. VtgateVschemaCounts turns into vtgate_vtgate_vschema_count)

Instrumentation changes

  • Adds Gauge-style types (Gauge, GaugeWithLabels, etc.) Prometheus expects Gauges and Counters to be separate types and we need this information in order to specify what kind of metric we're exporting.

  • Renamed all metric types. Mapping of rename:
    Int => Counter/Gauge (depending)
    IntFunc => GaugeFunc
    Counters => CountersWithLabels / GaugesWithLabels(depending)
    MultiCounters => CountersWithMultiLabels /GaugesWithMultiLabels (depending)
    MultiCounterFunc => CountersFuncWithMultiLabels

  • Adds a description string per metric.

  • Adds an explicit labelName for the tags in MultiCounters. Prometheus's labels are all key/value pairs.

  • Changes calls to stats.Publish(<IntFunc>/<DurationFunc>) to call stats.NewIntFunc/stats.NewDurationFunc explicitly instead.

TODO still

  • Add tests for each of the types of metrics
  • Add support for DurationFunc
  • Move registration of promhttp into servenv
  • After #3816 is merged, modify it to support toSnakeCase and a few replacers instead.
  • Get TravisCI working with the new packages I've added to the vendor file (promhttp)
  • Add a prombackend plugin for each component with metrics to be exported
  • It looks like many of the Counters/MultiCounters/MultiCounterFuncs are actually Gauges in practice. Since Prometheus differentiates between these two types, we need to go through and swap some of these out, where appropriate.
  • Dedupe the metric name with the namespace. e.g. vtgate_vtgate_vschema_counts should just be vtgate_vschema_counts

Metric types not exported

  • NewString / NewStringMap / PublishJSONFunc
    Prom. doesn't handle string values. Some of these could be converted into different types, if we'd like to export the metrics anyway.

  • NewFloat
    This isn't called anywhere today. We could maybe export them anyway in case there are any callers in the future.

  • direct calls to stats.Publish(...CounterFunc())
    I think this should get ported in a followup PR, possibly refactoring it to use MultiCounterFunc.

Additional things to do in follow-on PRs

  • Visit Rates / DurationFunc callsites and instrument them so that we can derive the rate/duration in Prometheus at querytime.

This change is Reviewable

@zmagg zmagg changed the title WIP PR to add support for exporting metrics to Prometheus. Add support for exporting metrics to Prometheus. Mar 28, 2018
publish(name, t)
}

publishPullMultiGauges(t, name)
Copy link
Member

@sougou sougou Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to do this by registering a new var hook using stats.Register, and have that function make these calls. This way, there will be no need to spray these additional publish calls through all the types.

Loading

@@ -229,6 +231,53 @@ func (v *Int) String() string {
return strconv.FormatInt(v.i.Get(), 10)
}

// Help returns the help string
func (v *Int) Help() string {
Copy link
Member

@sougou sougou Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these simple types Int is same as IntGauge. We might as well have only one of them. Maybe we can just rename Int to IntGauge.

Loading


// String is the implementation of expvar.var
func (f IntFunc) String() string {
return strconv.FormatInt(f(), 10)
func (intFunc IntFunc) String() string {
Copy link
Member

@sougou sougou Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go style recommends using single letter names for simple types like this. I don't necessarily agree with it, but it's better to follow the guidelines :).

Loading

func Init(namespace string) {
http.Handle("/metrics", promhttp.Handler())
promBackend := &PromBackend{namespace: namespace}
stats.RegisterPullBackendImpl("prom", promBackend)
Copy link
Member

@sougou sougou Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just call stats.Register with that new function that can call the various pull functions.

Loading

return output
}

func toSnake(name string) string {
Copy link
Member

@sougou sougou Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michael-berlin is there a way you can export youtube's function that converts to snake-case? I'm assuming that it shouldn't have anything proprietary.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the snake case / kebab-case conversation here:

(From @michael-berlin in Reviewable:) Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens).
I've exported the code: #3816
But I'm not sure if it's worth it merging it.
Also note that we do not have special handling for certain words e.g. the resulting internal variable name is vtgate-v-schema-counts :)

Ooh. Thanks for exporting!

@sougou Thoughts here? I can also just change my toSnakeCase to follow what toKebabCase is doing with the regexp package (which looks significantly cleaner than my hack with runes), and add a test.

I would also prefer to do some special casing (For things like VSchema, VtGate) as part of the Prometheus export.

Loading

for i := range cutoffs {
key := float64(cutoffs[i]) / 1000000000
//TODO(zmagg): int64 => uint64 conversion. error if it overflows?
output[key] = uint64(buckets[i]) + last
Copy link
Member

@demmer demmer Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me the buckets should just be uint64 from the start.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. They're sync2.AtomicInt64s in histogram.go.

Loading

Copy link
Member

@demmer demmer Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's wrong too :)

Of course the "right" thing seems to me that we add a sync2.AtomicUint64 and change all the histogram metrics to use that so we don't have to do this dance here.

@sougou any opinion here?

Loading

Copy link
Member

@sougou sougou Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that float64 should not be used as key because it's not an exact type. However, I recommend using sync2.AtomicInt64.

Basically, unsigned ints should only be used to represent non-numbers, like bitmaps. This is actually a coding standard within google, and I agree with it based on my own experience.
Using unsigned has the same problems of a const char *. It eventually poisons the entire code base.

Having said that, we have a few embarrassing places where uin64 is used as a number. It's mostly contained for now. Someday, we'll fix it :).

Loading

Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Well ok then, I guess we stick with signed ints and this casting since the prom API wants unsigned ints for bucket levels (which does still seem more logical to me).

Loading

@zmagg zmagg force-pushed the prom-metrics branch 2 times, most recently from 2648606 to 53195d7 Apr 5, 2018
@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 5, 2018

Updated description to account for new implementation changes and new TODOs. : )

Loading

@@ -0,0 +1,27 @@
package stats

// PullBackend should be implemented to export pull metrics from Vitess.
Copy link
Member

@demmer demmer Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new change to just use the stats.Publish hook I actually don't think we need this interface any more since it looks like we never use it anywhere.

That also would let all the prom registration functions be private to the package instead of public.

Loading


// metricsCollector collects both stats.Counters and stats.Gauges
type metricsCollector struct {
counters map[*stats.Counters]*prom.Desc
Copy link
Member

@demmer demmer Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell reading below, it looks like we never actually put more than one stats.Counters into any given metricsCollector, so can this be simplified to just have a pointer to the counter and a pointer to the Desc?

Loading

@zmagg zmagg force-pushed the prom-metrics branch from b7d2c26 to 667f2ce Apr 5, 2018
Copy link
Member

@demmer demmer left a comment

This is looking great!
A large number of comments above are mostly about the Counter / Gauge distinction but overall this is getting close.

@sougou Can you take a look as well?

Loading

connAccept = stats.NewInt("MysqlServerConnAccepted")
connSlow = stats.NewInt("MysqlServerConnSlow")
timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers")
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connCount is a gauge since it is the "Active mysql server connections", while connAccept is the "Count of accepted mysql server connections.

IMO it would be clearer if we renamed the vars themselves to connsActive and connsAccepted to make this clearer.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Changed the help string but I didn't rename the vars.)

Loading

timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers")
connAccept = stats.NewCounter("MysqlServerConnAccepted", "Connections accepted by MySQL server")
connSlow = stats.NewCounter("MysqlServerConnSlow", "Slow MySQL server connections")
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better description would be "Count of connections that took more than the configured mysql_slow_connect_warn_threshold to establish.

Loading

ConnCount: stats.NewInt(countTag),
ConnAccept: stats.NewInt(acceptTag),
ConnCount: stats.NewCounter(countTag, "Connection count inside net.Listener"),
ConnAccept: stats.NewCounter(acceptTag, "Connections accepted inside net.Listener"),
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnCount is also a Gauge. I'd recommend the same help string as with the mysql protocol case.

Loading

return v
}

// Add adds the provided value to the Int
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit -- replace Int with Counter

Loading

// Reset resets all counter values
func (c *Counters) Reset() {
c.mu.Lock()
defer c.mu.Unlock()
c.counts = make(map[string]*int64)
}

// ResetCounter resets a specific counter value to 0
func (c *Counters) ResetCounter(name string) {
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I would change this function to be Reset(name string) and the above one ResetAll().

Loading

UserTableQueryCount = stats.NewMultiCounters("UserTableQueryCount", []string{"TableName", "CallerID", "Type"})
UserTableQueryCount = stats.NewCountersWithMultiLabels(
"UserTableQueryCount",
"Number of queries received for each CallerID/table comb",
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"comb"=> "combination"

Loading

UserTableQueryTimesNs = stats.NewMultiCounters("UserTableQueryTimesNs", []string{"TableName", "CallerID", "Type"})
UserTableQueryTimesNs = stats.NewCountersWithMultiLabels(
"UserTableQueryTimesNs",
"Shows total latency for each CallerID/table combo",
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think drop "Shows" and change "combo" to "combination"

Loading

@@ -238,7 +238,7 @@ func NewTabletServer(config tabletenv.TabletConfig, topoServer *topo.Server, ali
// So that vtcombo doesn't even call it once, on the first tablet.
// And we can remove the tsOnce variable.
tsOnce.Do(func() {
stats.Publish("TabletState", stats.IntFunc(func() int64 {
stats.NewGaugeFunc("TabletState", "Tablet server's state", (func() int64 {
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tablet server state"

Loading

queueExceeded = stats.NewCounters("TxSerializerQueueExceeded")
queueExceeded = stats.NewCountersWithLabels(
"TxSerializerQueueExceeded",
"Number of transactions that were rejcted because the max queue size per row range was exceeded",
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rejected"

Loading

// statsStateDurations tracks for each state how much time was spent in it. Mainly used for testing.
statsStateDurationsNs = stats.NewCounters("WorkerStateDurations")
statsStateDurationsNs = stats.NewGaugesWithLabels("WorkerStateDurations", "How much time was spent in each state", "state")
Copy link
Member

@demmer demmer Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a Timer?

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's a good idea. Followup PR? I'd like to avoid breaking more in this one. : )

Loading

@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 9, 2018

Reviewed 5 of 43 files at r2, 3 of 19 files at r3, 3 of 12 files at r4.
Review status: 11 of 62 files reviewed at latest revision, 37 unresolved discussions.


go/cmd/vtgate/plugin_prombackend.go, line 27 at r4 (raw file):

func init() {
	prombackend.Init("vtgate")

Is it necessary to have a different namespace for each binary?

If so, what about other binaries which we have e.g. vtctld and vtworker?

Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)


go/stats/export.go, line 266 at r4 (raw file):

// IntFunc converts a function that returns
// an int64 as an expvar.
type IntFunc func() int64

As @demmer pointed out before, you'll probably need a CounterFunc which will be the renamed version of this.

Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change.


go/stats/prombackend/prombackend.go, line 144 at r1 (raw file):

Previously, sougou (Sugu Sougoumarane) wrote…

@michael-berlin is there a way you can export youtube's function that converts to snake-case? I'm assuming that it shouldn't have anything proprietary.

Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens).

I've exported the code: #3816

But I'm not sure if it's worth it merging it.

Also note that we do not have special handling for certain words e.g. the resulting internal variable name is vtgate-v-schema-counts :)


go/stats/prombackend/prombackend.go, line 11 at r4 (raw file):
https://github.com/golang/go/wiki/CodeReviewComments#imports

Avoid renaming imports except to avoid a name collision; good package names should not require renaming.

Given this style guide recommendation, you shouldn't rename the import to "prom".

(In general, I personally don't like abbreviations (unless it's a local variable or a Go receiver name). E.g. "prombackend" is less descriptive than "prometheus_backend". Similar comment for the other names e.g. "PublishPromMetric".)


go/vt/vtgate/masterbuffer/masterbuffer.go, line 50 at r4 (raw file):

	fakeBufferDelay        = flag.Duration("fake_buffer_delay", 1*time.Second, "The amount of time that we should delay all master requests for, to fake a buffer.")

	bufferedRequestsAttempted  = stats.NewCounter("BufferedRequestsAttempted", "Count of buffered requests attempted")

Head's up: This is dead code which I'm going to delete here: #3814


go/vt/worker/worker.go, line 63 at r4 (raw file):

	// statsRetryCount is the total number of times a query to vttablet had to be retried.
	statsRetryCount = stats.NewCounter("WorkerRetryCount", "Total number of times a query to a vttablet had to be retried")
	// statsRetryCount groups the number of retries by category e.g. "TimeoutError" or "Readonly".

Can you please delete the comments since you copied them into the description now?

Please make sure that no information gets lost e.g. the description for this stats var misses the examples e.g. "TimeoutError" or "Readonly".


Comments from Reviewable

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 9, 2018

@michael-berlin

Replying here in a few comments, as I can't respond in Reviewable for now. (I need to check with some internal people about the permissions that responding in Reviewable requires (it asked for write perms to all my orgs...?))

Small stuff first:

Is it necessary to have a different namespace for each binary?
If so, what about other binaries which we have e.g. vtctld and vtworker?
Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)

@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?

Avoid renaming imports except to avoid a name collision; good package names should not require renaming.
Given this style guide recommendation, you shouldn't rename the import to "prom".
(In general, I personally don't like abbreviations (unless it's a local variable or a Go receiver name). E.g. "prombackend" is less descriptive than "prometheus_backend". Similar comment for the other names e.g. "PublishPromMetric".)

That sounds good actually, I'll do a rename pass.

Head's up: This is dead code which I'm going to delete here: #3814

Thanks for the head's up : )

Can you please delete the comments since you copied them into the description now?
Please make sure that no information gets lost e.g. the description for this stats var misses the examples e.g. "TimeoutError" or "Readonly".

Will do!

Loading

@@ -261,15 +228,6 @@ func (v *Duration) String() string {
return strconv.FormatInt(int64(v.i.Get()), 10)
}

// IntFunc converts a function that returns
Copy link
Collaborator Author

@zmagg zmagg Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(From @michael-berlin in Reviewable): As @demmer pointed out before, you'll probably need a CounterFunc which will be the renamed version of this.
Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change.

Yes! It's changed to a GaugeFunc in a9d7fef
I think we'll also need a CounterFunc, for the ones that are Counters and not Gauges.

re: breaking the Google internal plugins

I'd love to understand how the Google internal plugins use this so that I can avoid breaking it. Are you using the expvar hook or some other way of exporting and converting the expvars to the internal system?

Is it the removal here that is likely to break (So, here, we went from IntFunc as a parameter to stats.Publish to explicitly creating a new GaugeFunc), or is it the any/all metric type renames? If it's the latter, head's up that I actually renamed almost all of the metric names. Some of these were to more clearly indicate the Gauge/Counter difference. (Full map in PR description).

Will that require a lot of work on y'all's part to update the plugins internally?

Loading

Copy link
Contributor

@michael-berlin michael-berlin Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal code basically just adds additional stats variables.

Examples:

stats.Publish("XXX", stats.IntFunc(<some method>))

stats.Publish("XXX", stats.FloatFunc(func() float64 { return float64(<some method>) }))

timings:   stats.NewMultiTimings("VtgateXxxApi", []string{"Operation", "Keyspace", "DbType"}),

errorCounters = stats.NewCounters("XXXErrors", "A", "B")

requests = stats.NewMultiCounters("XXX", []string{"A", "B", "C"})

As long as it's easy to rewrite this code, it won't be a lot of work and fine.

Loading

)

func init() {
prombackend.Init("vtgate")
Copy link
Contributor

@michael-berlin michael-berlin Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread here instead of using Reviewable:

I wrote:

Is it necessary to have a different namespace for each binary?
If so, what about other binaries which we have e.g. vtctld and vtworker?
Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)

zmagg@ responded:

@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?

The plugin here actually does two things:

  1. Install the "/metrics" handler.
  2. Set the namespace.

Would it make sense to move 1. to our servenv instead and install the handler by default? What are the downsides of having the handler in there by default?

re: 2. If setting the namespace is a crucial aspect, I would prefer to do it in code. This way it's more reusable. Otherwise, other people will ask you for this static map config ;) If you set the namespace in code, my comment is that you should set the namespace for all binaries, not just vtgate and vttablet.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting.

  1. I think moving it into servenv makes sense, location-wise. The thought here was that we wanted to be able to not have any Prometheus-stuff running if a build didn't want it, but I think we can handle that inside servenv with a flag.
  2. Yeah, setting the namespace is crucial. I've added one for the vtworker and vtctl binaries now, are there any that I missed that metrics are in too? Those are all the binaries that we use at Slack.

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 11, 2018

Updated the PR description for what's left.

Commit I just pushed included responses to all the review comments except:

  • Support for DurationFunc
  • Moving the registration of the promhttp handler into servenv

Also, I still need to add unit tests, and we need to resolve what to do about this toSnakeCase thing.

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 13, 2018

@demmer Can you take another look? The PR description is up to date with my latest changes. : )

One thing I'm considering adding is preventing Counter-type metrics from calling Add() with negative numbers. Thoughts?

Loading

Copy link
Member

@demmer demmer left a comment

This is looking really really close!

Most of these comments relate to description nit picks.

The only substantive issue relates to the need for Counter/Gauge variants of DurationFunc.

Also as you I'm sure noticed, there are merge conflicts and test / CodeClimate cleanup.

Loading


func init() {
servenv.OnRun(func() {
http.Handle("/metrics", promhttp.Handler())
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than make each of the plugins pull in the promhttp details and register the handler, why not move this inside prometheusbackend.Init.

Loading

connSlow = stats.NewInt("MysqlServerConnSlow")
timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewGauge("MysqlServerConnCount", "Active MySQL server connections")
connAccept = stats.NewCounter("MysqlServerConnAccepted", "Count of connections accepted by MySQL server")
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly this is taste, but I find the use of "Count of" to be unnecessary in this and other descriptions. Ideally we should be consistent one way or the other and either always use it or never use it.

Personally I find "Accepted MySQL server connections" to be sufficiently descriptive and more consistent with the gauge case (where we don't say "Gauge of active MySQL server connections").

Loading

@@ -37,8 +38,8 @@ type countingConnection struct {
func Published(l net.Listener, countTag, acceptTag string) net.Listener {
return &CountingListener{
Listener: l,
ConnCount: stats.NewInt(countTag),
ConnAccept: stats.NewInt(acceptTag),
ConnCount: stats.NewGauge(countTag, "Active connections accepted by counting listener"),
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has always been confusing to me that ConnCount and ConnAccepted don't have more descriptive names, and these generic descriptions don't help either :)

While I'm happy to defer this to a subsequent improvement, IMO we should change the proc.Listen API to include a description string and pass that here as well along with the countTag.

Then these would be:

ConnCount: stats.NewGauge(countTag, fmt.Sprintf("Active %s connections", description)),
ConnAccept: stats.NewCounter(countTag, fmt.Sprintf("Accepted %s connections", description)),

Since these are only used for HTTP, we would then change servenv.Run to include "HTTP" as the description string.

I also think we would be well served to rename the generic metric names as well to be HTTPConnCount and HTTPConnAccepted but that would break backwards compatibility. @sougou / @michael-berlin thoughts on this?

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but let's do it in a follow-on?

Loading

Copy link
Member

@demmer demmer Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Would still like to hear from @sougou and @michael-berlin about the backwards-incompatibility of this proposed change.

Loading

}

// Add adds the provided value to the Counter
func (v *Counter) Add(delta int64) {
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the prom API includes an Inc() method which seems nicer than Add(1), especially if we follow through with the idea to prevent Add of a negative number.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging for now!

Loading

ch <- prometheus.MustNewConstMetric(c.desc, prometheus.CounterValue, float64(c.df.F()/1000000000.0))
}

func toPromValueType(vt ValueType) prometheus.ValueType {
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this conversion and the ValueType enum at all? It seems like holdover from the more generic pullbackend design, and with this current approach we can avoid the conversion by just passing the appropriate prometheus metric type along when we instantiate the collector.

Loading

stats.NewGaugeFunc(name+"Active", "Tablet server conn pool active", cp.Active)
stats.NewGaugeFunc(name+"InUse", "Tablet server conn pool in use", cp.InUse)
stats.NewGaugeFunc(name+"MaxCap", "Tablet server conn pool max cap", cp.MaxCap)
stats.NewGaugeFunc(name+"WaitCount", "Tablet server conn pool wait count", cp.WaitCount)
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment applies :)

Loading

stats.NewGaugeFunc("WarnResultSize", "Query engine warn result size", qe.warnResultSize.Get)
stats.NewGaugeFunc("MaxDMLRows", "Query engine max DML rows", qe.maxDMLRows.Get)
stats.NewGaugeFunc("StreamBufferSize", "Query engine stream buffer size", qe.streamBufferSize.Get)
stats.NewGaugeFunc("TableACLExemptCount", "Query engine table ACL exempt count", qe.tableaclExemptCount.Get)
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableACLExemptCount is a Counter.

Loading

stats.NewGaugeFunc("QueryCacheLength", "Query engine query cache length", qe.plans.Length)
stats.NewGaugeFunc("QueryCacheSize", "Query engine query cache size", qe.plans.Size)
stats.NewGaugeFunc("QueryCacheCapacity", "Query engine query cache capacity", qe.plans.Capacity)
stats.NewGaugeFunc("QueryCacheEvictions", "Query engine query cache evictions", qe.plans.Evictions)
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryCacheEvictions is a counter

Loading

// QPSRates shows the qps of QueryStats. Sample every 5 seconds and keep samples for up to 15 mins.
QPSRates = stats.NewRates("QPS", QueryStats, 15*60/5, 5*time.Second)
// WaitStats shows the time histogram for wait operations
WaitStats = stats.NewTimings("Waits")
WaitStats = stats.NewTimings("Waits", "Shows the time histogram for wait operations")
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More description nits: "Shows the..." should be just "Wait operations" I think

Loading

"Errors",
"Number of critical errors that happened",
Copy link
Member

@demmer demmer Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

"Critical errors"
"Internal component errors"
...

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 14, 2018

Ok!

Responded to your comments.

Major changes:

  • Changed up CounterFunc & GaugeFunc to take in a new type: MetricFunc that allows us to both export ints (like it did before) and now also durations.

I did a best effort pass through of which of the DurationFuncs were gauges/counters, but would love an extra eye on that @demmer.

Also rebased, made all the other changes I 👍'd to your comments on.

What's left:

  • Now that the toKebabCase 🍢 is merged, I'll take a look next week when I'm back to make the toSnake modifications we talked about in person.
  • As discussed in person, add logging / increment a counter if we ever call Add() with a negative number on a Counter. In future PRs, we can maybe ban such things once we're confident we've removed all existing cases from the codebase.

Loading

@zmagg zmagg force-pushed the prom-metrics branch 2 times, most recently from 3ca8b93 to e298278 Apr 18, 2018
// For systems (like Prometheus) that ask for snake-case names.
// It converts CamelCase to camel_case, and CAMEL_CASE to camel_case.
// For numbers, it converts 0.5 to v0_5.
func toSnakeCase(name string) (hyphenated string) {
Copy link
Collaborator Author

@zmagg zmagg Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the toSnakeCase modifications based on the toKebabCase stuff. I'm not totally convinced it structurally makes sense to put it here (at minimum, I should rename the file if it otherwise makes sense to you).

Also, it shares a memoizer with the toKebabCase, which should be fine (one build should only use one or the other), but perhaps they should be separate in the code too?

@demmer, thoughts?

Loading

Copy link
Member

@demmer demmer Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety and cleanliness I actually don't think they should share a memoizer.

Given that... I don't think there's actually any substantive shared code any more, so rather than misleadingly sharing the file name, I'd say you should just create a snake_case_converter.go with the new code and a separate memoizer.

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 18, 2018

@demmer
ok, finished up the rest of the items, although I have a few questions about where to put the toSnakeCase stuff structurally (commented in the code here: e298278#r182589219 )

Thoughts now?

Loading

@demmer
Copy link
Member

@demmer demmer commented Apr 19, 2018

From my standpoint the only remaining item before I think this is good to merge is the minor reorganization of where the snake conversion goes.

@michael-berlin / @alainjobart since this will entail some breaking changes with internal Google importing I would like a 👍 from one of you that this is ok to merge.

@sougou told me in DM he doesn't care, so he doesn't get a vote on this one.

Loading

@sougou
Copy link
Member

@sougou sougou commented Apr 19, 2018

I do care, but I know it's in good hands with @michael-berlin and @alainjobart :)

Loading

// Add adds the provided value to the Counter
func (v *Counter) Add(delta int64) {
if delta < 0 {
log.Warningf("Adding a negative value to a counter, %v should be a gauge instead", v)
Copy link
Contributor

@michael-berlin michael-berlin Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely is it that this will get called?

If it's not likely at all because you already checked all callers in open-source, then I'm okay with it as is.

Otherwise, it may be a good idea to use a throttled logger here instead. This way we would avoid that bad code spams the log, makes the disk full and potentially slows down the process.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good idea. I've checked all the easy sites (Add(-1)) but there's a lot more that wind their way through the code that I don't have confidence on, hence logging for now instead of banning decrementing all together. I switched to using a throttled logger here. I also switched to using a throttled logger for the unsupported export type logger in prometheusbackend.go, thanks for the suggestion!

Loading

@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 19, 2018

It looks like this is good to merge? If so, let's do as follows:

  1. I'll try to import it locally and make all necessary changes.
  2. If thinks are working, I'll submit the internal import.
  3. Shortly after that, I'll merge this PR.

Let me try to get to this today or tomorrow.

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 19, 2018

@michael-berlin That sounds awesome!

Let me figure out what's going on with these failing TravisCI tests after my last commit. The tests still succeed for me locally, so not sure. Debugging now.

Loading

// Counters is similar to expvar.Map, except that
// it doesn't allow floats. In addition, it provides
// a Counts method which can be used for tracking rates.
// it doesn't allow floats. It is used to build CountersWithLabels and GaugesWithLabels.
type Counters struct {
Copy link
Contributor

@michael-berlin michael-berlin Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually necessary to export this?

I saw that you're using it in the promstatsbackend but the usage was not immediately clear to me.

I'm asking because we have several type switches in the code base and I made the mistake of using "stats.Counter" instead of "stats.CountersWithLabels" in these type switches.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, it doesn't make sense. Let me subsume what it's doing into CountersWithLabels & GaugesWithLabels.

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Loading


// CountersWithLabels provides a labelName for the tagged values in Counters
// It provides a Counts method which can be used for tracking rates.
type CountersWithLabels struct {
Copy link
Contributor

@michael-berlin michael-berlin Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is only a single label and not multiple labels.

i.e. it should be named CountersWithLabel ?

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't have strong feelings about this but the plural-Labels was chosen because that's what the Prometheus golang client uses for naming, and it is multiple label values, if not multiple label keys.

Loading

Copy link
Contributor

@michael-berlin michael-berlin Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Let's keep it as is for now.

Loading

}

func TestGaugeFunc(t *testing.T) {
var gotname string
Copy link
Contributor

@michael-berlin michael-berlin Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables are unused.

Same for TestDurationFunc below.

Also: The Register() here seems to be unnecessary as well?

Can you please double-check this file and report back? Thanks!

Loading

Copy link
Collaborator Author

@zmagg zmagg Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for the catch. I added tests against gotname and gotv for GaugeFunc and DurationFunc.

Loading

@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 20, 2018

I have an internal import open with all the necessary changes to our internal code. Tests are passing now.

Can you please look at my last comments and address them in additional CLs? Once that's done, I'll patch them into my pending import as well and then we can wrap things up.

Loading

@zmagg
Copy link
Collaborator Author

@zmagg zmagg commented Apr 20, 2018

@michael-berlin Yay, awesome! I responded to your comments and rebased to resolve the merge conflict.

Loading

@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 20, 2018

I ran into the merge conflict during the import as well. Therefore, I just reverted that PR.

Given that, can you please rebase again against the latest master?

Loading

zmagg added 13 commits Apr 20, 2018
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
* Fix the nanoseconds => ints for histogram buckets.
* Gigantic rename of metrics:

-Int => Counter
-IntGauge => Gauge
-IntFunc => GaugeFunc
-Counters => CountersWithLabels
-MultiCounters => CountersWithMultiLabels
-MultiCounterFunc => CountersFuncWithMultiLabels
-Gauges => GaugesWithLabels
-MultiGauges=> GaugesWithMultiLabels

* Add an explicit labelName for CountersWithLabels & GaugesWithLabels

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
pull_backend.go stuff into prombackend/prombackend.go's Register call
and make the pull_backend.go an interface only.

Also, fix some more types from the gigantic refactor, and run
GoLint/GoVet

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Fix tests to use the new metric names.
Fix tests that expected gauges and not counters. (more to come on this
thing, but these were the tests that indicated gauges early)
Fix the `Set()` implementation for GaugesWithMultiLabels.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Add plugins for each of the components.

Rename prombackend => prometheusbackend
Rename publishPromMetric to publishPrometheusMetric

Rename Reset() => ResetAll()
ResetCounter => Reset()

Rename various Counters => Gauges per Mike's helpful pointers.
Rename a few help strings.
Remove the comments in go/vt/worker/worker.go metrics and move them 100% into help functions.
Add a GaugeFuncWithMultiLabels
Add a CounterFunc

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Convert Nanoseconds to seconds

Call Init (handler) inside servenv.OnRun()

Add unit tests.

For counter, it doesn't make sense to have a function called ResetAll
and not one called Reset so re-rename that.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
…icFunc` that allows us to both export ints (like it did before) and now also durations.

- Moved Prom specific stuff in the plugins to the prometheus backend
itself.
- A ton of stats help string fix ups.
- some counters are gauges, and some gauges are counters! Fix those.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
case.

Log an warning when we call Add() on a counter with a negative number.

Fix the double underscore in the prometheus exported metrics when we
dedupe the namespace.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Add a conversion function from - to _ and add unit test.
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
counters.

Remove some unneccessary commented out code.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Don't export counters. Refactor prometheusbackend.go code to collect
CountersWithLabels and GaugesWithLabels explicitly instead of relying on
underlying unexported-now Counters.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 20, 2018

I just did it myself and pushed it to your branch.

Loading

An existing test used this functionality. When we stopped exporting the
"counters" type, the test lost that functionality. Instead of relying on
an implementation detail, there is a proper public method now.

Signed-off-by: Michael Berlin <mberlin@google.com>
@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 21, 2018

Internal import is done.

I'll merge this as soon as Travis has passed.

Loading

@michael-berlin
Copy link
Contributor

@michael-berlin michael-berlin commented Apr 21, 2018

The race unit test did flake and therefore test shard 0 failed. I'm ignoring that and will merge this now.

Loading

@michael-berlin michael-berlin merged commit 2abc74a into vitessio:master Apr 21, 2018
3 of 4 checks passed
Loading
michael-berlin added a commit to michael-berlin/vitess that referenced this issue Apr 28, 2018
vitessio#3784 introduced two layers of code for "stats.Counter" (and the new type "stats.Gauge").

As a consequence, it was for example possible to create a "stats.Gauge" for a time.Duration value.

This approach did not work well with our internal usage of the stats package.

Therefore, I reversed this and simplified the code:

- MetricFunc() interface was removed. Instead, a CountFunc or a GaugeFunc requires a simple func() int64 as input.
- IntFunc() removed. Before vitessio#3784, it was used to implement the expvar.Var interface. But now, this is taken care of by the types itself e.g "stats.CounterFunc". Therefore, we do not need it anymore and users of the "stats" package can just pass a plain func() int64.
- Added types "Duration" and "DurationFunc" back.
- Added "Variable" interface. This allowed to simplify the Prometheus code which depends on the Help() method for each stats variable.
- Prometheus: Conversion to float64 values is now done in prometheusbackend.go and removed from the stats package.

BUG=78571948

Signed-off-by: Michael Berlin <mberlin@google.com>
mcuadros pushed a commit to src-d/go-vitess that referenced this issue Sep 25, 2018