Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for exporting metrics to Prometheus. #3784

Merged
merged 14 commits into from Apr 21, 2018

Conversation

Projects
None yet
4 participants
@zmagg
Copy link
Collaborator

commented Mar 28, 2018

Description

WIP PR to export metrics to Prometheus, as discussed in #3644

I've gotten feedback from @demmer on the approach in this PR, but I'd like to get some feedback from others as well, especially around metric names and the API exposed. Despite the long list of changes below, there are no breaking changes in this PR for anybody that relies on the expvar stats.

Metrics backend changes / naming stuff

  • Adds pull_backend.go which supports registering generic pull backends which can scrape for metrics.
  • Adds a Prometheus backend that includes custom collectors per metric type and creates constant, throwaway, metrics "on the fly" at scrape time. The Prometheus backend is set up to be configured as a plugin per Vitess component.
  • The Prometheus backend also programmatically converts the camel case metric names to snake case metric names at scrape time, to align with the Prometheus naming conventions. This includes special casing names that don't convert easily (like VSchema) (See: toSnake()) .
    Right now, it also adds the name of the component as a namespace to the metric, but that produces some metric names that have duplicate prefixes, as we already prefix a lot of the Vitess metric names with the component type (e.g. VtgateVschemaCounts turns into vtgate_vtgate_vschema_count)

Instrumentation changes

  • Adds Gauge-style types (Gauge, GaugeWithLabels, etc.) Prometheus expects Gauges and Counters to be separate types and we need this information in order to specify what kind of metric we're exporting.

  • Renamed all metric types. Mapping of rename:
    Int => Counter/Gauge (depending)
    IntFunc => GaugeFunc
    Counters => CountersWithLabels / GaugesWithLabels(depending)
    MultiCounters => CountersWithMultiLabels /GaugesWithMultiLabels (depending)
    MultiCounterFunc => CountersFuncWithMultiLabels

  • Adds a description string per metric.

  • Adds an explicit labelName for the tags in MultiCounters. Prometheus's labels are all key/value pairs.

  • Changes calls to stats.Publish(<IntFunc>/<DurationFunc>) to call stats.NewIntFunc/stats.NewDurationFunc explicitly instead.

TODO still

  • Add tests for each of the types of metrics
  • Add support for DurationFunc
  • Move registration of promhttp into servenv
  • After #3816 is merged, modify it to support toSnakeCase and a few replacers instead.
  • Get TravisCI working with the new packages I've added to the vendor file (promhttp)
  • Add a prombackend plugin for each component with metrics to be exported
  • It looks like many of the Counters/MultiCounters/MultiCounterFuncs are actually Gauges in practice. Since Prometheus differentiates between these two types, we need to go through and swap some of these out, where appropriate.
  • Dedupe the metric name with the namespace. e.g. vtgate_vtgate_vschema_counts should just be vtgate_vschema_counts

Metric types not exported

  • NewString / NewStringMap / PublishJSONFunc
    Prom. doesn't handle string values. Some of these could be converted into different types, if we'd like to export the metrics anyway.

  • NewFloat
    This isn't called anywhere today. We could maybe export them anyway in case there are any callers in the future.

  • direct calls to stats.Publish(...CounterFunc())
    I think this should get ported in a followup PR, possibly refactoring it to use MultiCounterFunc.

Additional things to do in follow-on PRs

  • Visit Rates / DurationFunc callsites and instrument them so that we can derive the rate/duration in Prometheus at querytime.

This change is Reviewable

@zmagg zmagg changed the title WIP PR to add support for exporting metrics to Prometheus. Add support for exporting metrics to Prometheus. Mar 28, 2018

@zmagg zmagg referenced this pull request Mar 28, 2018

Closed

Implementation of all Vitess counters => Prom #79

7 of 9 tasks complete
publish(name, t)
}

publishPullMultiGauges(t, name)

This comment has been minimized.

Copy link
@sougou

sougou Apr 2, 2018

Member

I think you should be able to do this by registering a new var hook using stats.Register, and have that function make these calls. This way, there will be no need to spray these additional publish calls through all the types.

@@ -229,6 +231,53 @@ func (v *Int) String() string {
return strconv.FormatInt(v.i.Get(), 10)
}

// Help returns the help string
func (v *Int) Help() string {

This comment has been minimized.

Copy link
@sougou

sougou Apr 2, 2018

Member

For all these simple types Int is same as IntGauge. We might as well have only one of them. Maybe we can just rename Int to IntGauge.


// String is the implementation of expvar.var
func (f IntFunc) String() string {
return strconv.FormatInt(f(), 10)
func (intFunc IntFunc) String() string {

This comment has been minimized.

Copy link
@sougou

sougou Apr 2, 2018

Member

Go style recommends using single letter names for simple types like this. I don't necessarily agree with it, but it's better to follow the guidelines :).

func Init(namespace string) {
http.Handle("/metrics", promhttp.Handler())
promBackend := &PromBackend{namespace: namespace}
stats.RegisterPullBackendImpl("prom", promBackend)

This comment has been minimized.

Copy link
@sougou

sougou Apr 2, 2018

Member

This should just call stats.Register with that new function that can call the various pull functions.

return output
}

func toSnake(name string) string {

This comment has been minimized.

Copy link
@sougou

sougou Apr 2, 2018

Member

@michael-berlin is there a way you can export youtube's function that converts to snake-case? I'm assuming that it shouldn't have anything proprietary.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 9, 2018

Author Collaborator

Moving the snake case / kebab-case conversation here:

(From @michael-berlin in Reviewable:) Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens).
I've exported the code: #3816
But I'm not sure if it's worth it merging it.
Also note that we do not have special handling for certain words e.g. the resulting internal variable name is vtgate-v-schema-counts :)

Ooh. Thanks for exporting!

@sougou Thoughts here? I can also just change my toSnakeCase to follow what toKebabCase is doing with the regexp package (which looks significantly cleaner than my hack with runes), and add a test.

I would also prefer to do some special casing (For things like VSchema, VtGate) as part of the Prometheus export.

for i := range cutoffs {
key := float64(cutoffs[i]) / 1000000000
//TODO(zmagg): int64 => uint64 conversion. error if it overflows?
output[key] = uint64(buckets[i]) + last

This comment has been minimized.

Copy link
@demmer

demmer Apr 4, 2018

Member

It seems to me the buckets should just be uint64 from the start.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 5, 2018

Author Collaborator

Hmm. They're sync2.AtomicInt64s in histogram.go.

This comment has been minimized.

Copy link
@demmer

demmer Apr 5, 2018

Member

I think that's wrong too :)

Of course the "right" thing seems to me that we add a sync2.AtomicUint64 and change all the histogram metrics to use that so we don't have to do this dance here.

@sougou any opinion here?

This comment has been minimized.

Copy link
@sougou

sougou Apr 5, 2018

Member

Agree that float64 should not be used as key because it's not an exact type. However, I recommend using sync2.AtomicInt64.

Basically, unsigned ints should only be used to represent non-numbers, like bitmaps. This is actually a coding standard within google, and I agree with it based on my own experience.
Using unsigned has the same problems of a const char *. It eventually poisons the entire code base.

Having said that, we have a few embarrassing places where uin64 is used as a number. It's mostly contained for now. Someday, we'll fix it :).

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

Interesting. Well ok then, I guess we stick with signed ints and this casting since the prom API wants unsigned ints for bucket levels (which does still seem more logical to me).

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch 2 times, most recently from 2648606 to 53195d7 Apr 5, 2018

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 5, 2018

Updated description to account for new implementation changes and new TODOs. : )

@@ -0,0 +1,27 @@
package stats

// PullBackend should be implemented to export pull metrics from Vitess.

This comment has been minimized.

Copy link
@demmer

demmer Apr 5, 2018

Member

With the new change to just use the stats.Publish hook I actually don't think we need this interface any more since it looks like we never use it anywhere.

That also would let all the prom registration functions be private to the package instead of public.


// metricsCollector collects both stats.Counters and stats.Gauges
type metricsCollector struct {
counters map[*stats.Counters]*prom.Desc

This comment has been minimized.

Copy link
@demmer

demmer Apr 5, 2018

Member

From what I can tell reading below, it looks like we never actually put more than one stats.Counters into any given metricsCollector, so can this be simplified to just have a pointer to the counter and a pointer to the Desc?

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch from b7d2c26 to 667f2ce Apr 5, 2018

@demmer
Copy link
Member

left a comment

This is looking great!
A large number of comments above are mostly about the Counter / Gauge distinction but overall this is getting close.

@sougou Can you take a look as well?

connAccept = stats.NewInt("MysqlServerConnAccepted")
connSlow = stats.NewInt("MysqlServerConnSlow")
timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers")

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

connCount is a gauge since it is the "Active mysql server connections", while connAccept is the "Count of accepted mysql server connections.

IMO it would be clearer if we renamed the vars themselves to connsActive and connsAccepted to make this clearer.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 11, 2018

Author Collaborator

(Changed the help string but I didn't rename the vars.)

timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewCounter("MysqlServerConnCount", "Connection count for MySQL servers")
connAccept = stats.NewCounter("MysqlServerConnAccepted", "Connections accepted by MySQL server")
connSlow = stats.NewCounter("MysqlServerConnSlow", "Slow MySQL server connections")

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

A better description would be "Count of connections that took more than the configured mysql_slow_connect_warn_threshold to establish.

ConnCount: stats.NewInt(countTag),
ConnAccept: stats.NewInt(acceptTag),
ConnCount: stats.NewCounter(countTag, "Connection count inside net.Listener"),
ConnAccept: stats.NewCounter(acceptTag, "Connections accepted inside net.Listener"),

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

ConnCount is also a Gauge. I'd recommend the same help string as with the mysql protocol case.

return v
}

// Add adds the provided value to the Int

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

Nit -- replace Int with Counter

// Reset resets all counter values
func (c *Counters) Reset() {
c.mu.Lock()
defer c.mu.Unlock()
c.counts = make(map[string]*int64)
}

// ResetCounter resets a specific counter value to 0
func (c *Counters) ResetCounter(name string) {

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

For consistency, I would change this function to be Reset(name string) and the above one ResetAll().

UserTableQueryCount = stats.NewMultiCounters("UserTableQueryCount", []string{"TableName", "CallerID", "Type"})
UserTableQueryCount = stats.NewCountersWithMultiLabels(
"UserTableQueryCount",
"Number of queries received for each CallerID/table comb",

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

"comb"=> "combination"

UserTableQueryTimesNs = stats.NewMultiCounters("UserTableQueryTimesNs", []string{"TableName", "CallerID", "Type"})
UserTableQueryTimesNs = stats.NewCountersWithMultiLabels(
"UserTableQueryTimesNs",
"Shows total latency for each CallerID/table combo",

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

Again I think drop "Shows" and change "combo" to "combination"

@@ -238,7 +238,7 @@ func NewTabletServer(config tabletenv.TabletConfig, topoServer *topo.Server, ali
// So that vtcombo doesn't even call it once, on the first tablet.
// And we can remove the tsOnce variable.
tsOnce.Do(func() {
stats.Publish("TabletState", stats.IntFunc(func() int64 {
stats.NewGaugeFunc("TabletState", "Tablet server's state", (func() int64 {

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

"Tablet server state"

queueExceeded = stats.NewCounters("TxSerializerQueueExceeded")
queueExceeded = stats.NewCountersWithLabels(
"TxSerializerQueueExceeded",
"Number of transactions that were rejcted because the max queue size per row range was exceeded",

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

"Rejected"

// statsStateDurations tracks for each state how much time was spent in it. Mainly used for testing.
statsStateDurationsNs = stats.NewCounters("WorkerStateDurations")
statsStateDurationsNs = stats.NewGaugesWithLabels("WorkerStateDurations", "How much time was spent in each state", "state")

This comment has been minimized.

Copy link
@demmer

demmer Apr 6, 2018

Member

Why not a Timer?

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 11, 2018

Author Collaborator

Hmm. It's a good idea. Followup PR? I'd like to avoid breaking more in this one. : )

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

Reviewed 5 of 43 files at r2, 3 of 19 files at r3, 3 of 12 files at r4.
Review status: 11 of 62 files reviewed at latest revision, 37 unresolved discussions.


go/cmd/vtgate/plugin_prombackend.go, line 27 at r4 (raw file):

func init() {
	prombackend.Init("vtgate")

Is it necessary to have a different namespace for each binary?

If so, what about other binaries which we have e.g. vtctld and vtworker?

Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)


go/stats/export.go, line 266 at r4 (raw file):

// IntFunc converts a function that returns
// an int64 as an expvar.
type IntFunc func() int64

As @demmer pointed out before, you'll probably need a CounterFunc which will be the renamed version of this.

Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change.


go/stats/prombackend/prombackend.go, line 144 at r1 (raw file):

Previously, sougou (Sugu Sougoumarane) wrote…

@michael-berlin is there a way you can export youtube's function that converts to snake-case? I'm assuming that it shouldn't have anything proprietary.

Turns out snake_case uses underscores but internally we're doing kebab-case (with hyphens).

I've exported the code: #3816

But I'm not sure if it's worth it merging it.

Also note that we do not have special handling for certain words e.g. the resulting internal variable name is vtgate-v-schema-counts :)


go/stats/prombackend/prombackend.go, line 11 at r4 (raw file):
https://github.com/golang/go/wiki/CodeReviewComments#imports

Avoid renaming imports except to avoid a name collision; good package names should not require renaming.

Given this style guide recommendation, you shouldn't rename the import to "prom".

(In general, I personally don't like abbreviations (unless it's a local variable or a Go receiver name). E.g. "prombackend" is less descriptive than "prometheus_backend". Similar comment for the other names e.g. "PublishPromMetric".)


go/vt/vtgate/masterbuffer/masterbuffer.go, line 50 at r4 (raw file):

	fakeBufferDelay        = flag.Duration("fake_buffer_delay", 1*time.Second, "The amount of time that we should delay all master requests for, to fake a buffer.")

	bufferedRequestsAttempted  = stats.NewCounter("BufferedRequestsAttempted", "Count of buffered requests attempted")

Head's up: This is dead code which I'm going to delete here: #3814


go/vt/worker/worker.go, line 63 at r4 (raw file):

	// statsRetryCount is the total number of times a query to vttablet had to be retried.
	statsRetryCount = stats.NewCounter("WorkerRetryCount", "Total number of times a query to a vttablet had to be retried")
	// statsRetryCount groups the number of retries by category e.g. "TimeoutError" or "Readonly".

Can you please delete the comments since you copied them into the description now?

Please make sure that no information gets lost e.g. the description for this stats var misses the examples e.g. "TimeoutError" or "Readonly".


Comments from Reviewable

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 9, 2018

@michael-berlin

Replying here in a few comments, as I can't respond in Reviewable for now. (I need to check with some internal people about the permissions that responding in Reviewable requires (it asked for write perms to all my orgs...?))

Small stuff first:

Is it necessary to have a different namespace for each binary?
If so, what about other binaries which we have e.g. vtctld and vtworker?
Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)

@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?

Avoid renaming imports except to avoid a name collision; good package names should not require renaming.
Given this style guide recommendation, you shouldn't rename the import to "prom".
(In general, I personally don't like abbreviations (unless it's a local variable or a Go receiver name). E.g. "prombackend" is less descriptive than "prometheus_backend". Similar comment for the other names e.g. "PublishPromMetric".)

That sounds good actually, I'll do a rename pass.

Head's up: This is dead code which I'm going to delete here: #3814

Thanks for the head's up : )

Can you please delete the comments since you copied them into the description now?
Please make sure that no information gets lost e.g. the description for this stats var misses the examples e.g. "TimeoutError" or "Readonly".

Will do!

@@ -261,15 +228,6 @@ func (v *Duration) String() string {
return strconv.FormatInt(int64(v.i.Get()), 10)
}

// IntFunc converts a function that returns

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 9, 2018

Author Collaborator

(From @michael-berlin in Reviewable): As @demmer pointed out before, you'll probably need a CounterFunc which will be the renamed version of this.
Note that the removal/rename of this will break the Google internal plugins. Given that, please give me or @alainjobart a head's up before merging this. This way, we can change all internal callers first and verify that they won't break due to this change.

Yes! It's changed to a GaugeFunc in a9d7fef
I think we'll also need a CounterFunc, for the ones that are Counters and not Gauges.

re: breaking the Google internal plugins

I'd love to understand how the Google internal plugins use this so that I can avoid breaking it. Are you using the expvar hook or some other way of exporting and converting the expvars to the internal system?

Is it the removal here that is likely to break (So, here, we went from IntFunc as a parameter to stats.Publish to explicitly creating a new GaugeFunc), or is it the any/all metric type renames? If it's the latter, head's up that I actually renamed almost all of the metric names. Some of these were to more clearly indicate the Gauge/Counter difference. (Full map in PR description).

Will that require a lot of work on y'all's part to update the plugins internally?

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 9, 2018

Contributor

Internal code basically just adds additional stats variables.

Examples:

stats.Publish("XXX", stats.IntFunc(<some method>))

stats.Publish("XXX", stats.FloatFunc(func() float64 { return float64(<some method>) }))

timings:   stats.NewMultiTimings("VtgateXxxApi", []string{"Operation", "Keyspace", "DbType"}),

errorCounters = stats.NewCounters("XXXErrors", "A", "B")

requests = stats.NewMultiCounters("XXX", []string{"A", "B", "C"})

As long as it's easy to rewrite this code, it won't be a lot of work and fine.

)

func init() {
prombackend.Init("vtgate")

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 9, 2018

Contributor

Starting a thread here instead of using Reviewable:

I wrote:

Is it necessary to have a different namespace for each binary?
If so, what about other binaries which we have e.g. vtctld and vtworker?
Does Prometheus require that? Its internal Google counterpart doesn't care and instead can distinguish binaries by their production job name ;)

zmagg@ responded:

@demmer and I talked about this a while back, but I forget why we landed on this. Prometheus does let you specify a namespace per binary in the static scrape config, which we could use instead of putting the namespace here. @demmer, any new thoughts?

The plugin here actually does two things:

  1. Install the "/metrics" handler.
  2. Set the namespace.

Would it make sense to move 1. to our servenv instead and install the handler by default? What are the downsides of having the handler in there by default?

re: 2. If setting the namespace is a crucial aspect, I would prefer to do it in code. This way it's more reusable. Otherwise, other people will ask you for this static map config ;) If you set the namespace in code, my comment is that you should set the namespace for all binaries, not just vtgate and vttablet.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 11, 2018

Author Collaborator

Oh interesting.

  1. I think moving it into servenv makes sense, location-wise. The thought here was that we wanted to be able to not have any Prometheus-stuff running if a build didn't want it, but I think we can handle that inside servenv with a flag.
  2. Yeah, setting the namespace is crucial. I've added one for the vtworker and vtctl binaries now, are there any that I missed that metrics are in too? Those are all the binaries that we use at Slack.
@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 11, 2018

Updated the PR description for what's left.

Commit I just pushed included responses to all the review comments except:

  • Support for DurationFunc
  • Moving the registration of the promhttp handler into servenv

Also, I still need to add unit tests, and we need to resolve what to do about this toSnakeCase thing.

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 13, 2018

@demmer Can you take another look? The PR description is up to date with my latest changes. : )

One thing I'm considering adding is preventing Counter-type metrics from calling Add() with negative numbers. Thoughts?

@demmer
Copy link
Member

left a comment

This is looking really really close!

Most of these comments relate to description nit picks.

The only substantive issue relates to the need for Counter/Gauge variants of DurationFunc.

Also as you I'm sure noticed, there are merge conflicts and test / CodeClimate cleanup.


func init() {
servenv.OnRun(func() {
http.Handle("/metrics", promhttp.Handler())

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

Rather than make each of the plugins pull in the promhttp details and register the handler, why not move this inside prometheusbackend.Init.

connSlow = stats.NewInt("MysqlServerConnSlow")
timings = stats.NewTimings("MysqlServerTimings", "MySQL server timings")
connCount = stats.NewGauge("MysqlServerConnCount", "Active MySQL server connections")
connAccept = stats.NewCounter("MysqlServerConnAccepted", "Count of connections accepted by MySQL server")

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

Admittedly this is taste, but I find the use of "Count of" to be unnecessary in this and other descriptions. Ideally we should be consistent one way or the other and either always use it or never use it.

Personally I find "Accepted MySQL server connections" to be sufficiently descriptive and more consistent with the gauge case (where we don't say "Gauge of active MySQL server connections").

@@ -37,8 +38,8 @@ type countingConnection struct {
func Published(l net.Listener, countTag, acceptTag string) net.Listener {
return &CountingListener{
Listener: l,
ConnCount: stats.NewInt(countTag),
ConnAccept: stats.NewInt(acceptTag),
ConnCount: stats.NewGauge(countTag, "Active connections accepted by counting listener"),

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

It has always been confusing to me that ConnCount and ConnAccepted don't have more descriptive names, and these generic descriptions don't help either :)

While I'm happy to defer this to a subsequent improvement, IMO we should change the proc.Listen API to include a description string and pass that here as well along with the countTag.

Then these would be:

ConnCount: stats.NewGauge(countTag, fmt.Sprintf("Active %s connections", description)),
ConnAccept: stats.NewCounter(countTag, fmt.Sprintf("Accepted %s connections", description)),

Since these are only used for HTTP, we would then change servenv.Run to include "HTTP" as the description string.

I also think we would be well served to rename the generic metric names as well to be HTTPConnCount and HTTPConnAccepted but that would break backwards compatibility. @sougou / @michael-berlin thoughts on this?

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 18, 2018

Author Collaborator

Agreed, but let's do it in a follow-on?

This comment has been minimized.

Copy link
@demmer

demmer Apr 19, 2018

Member

Yup. Would still like to hear from @sougou and @michael-berlin about the backwards-incompatibility of this proposed change.

}

// Add adds the provided value to the Counter
func (v *Counter) Add(delta int64) {

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

FWIW the prom API includes an Inc() method which seems nicer than Add(1), especially if we follow through with the idea to prevent Add of a negative number.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 18, 2018

Author Collaborator

Added logging for now!

ch <- prometheus.MustNewConstMetric(c.desc, prometheus.CounterValue, float64(c.df.F()/1000000000.0))
}

func toPromValueType(vt ValueType) prometheus.ValueType {

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

Why do we need this conversion and the ValueType enum at all? It seems like holdover from the more generic pullbackend design, and with this current approach we can avoid the conversion by just passing the appropriate prometheus metric type along when we instantiate the collector.

stats.NewGaugeFunc(name+"Active", "Tablet server conn pool active", cp.Active)
stats.NewGaugeFunc(name+"InUse", "Tablet server conn pool in use", cp.InUse)
stats.NewGaugeFunc(name+"MaxCap", "Tablet server conn pool max cap", cp.MaxCap)
stats.NewGaugeFunc(name+"WaitCount", "Tablet server conn pool wait count", cp.WaitCount)

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

Same comment applies :)

stats.NewGaugeFunc("WarnResultSize", "Query engine warn result size", qe.warnResultSize.Get)
stats.NewGaugeFunc("MaxDMLRows", "Query engine max DML rows", qe.maxDMLRows.Get)
stats.NewGaugeFunc("StreamBufferSize", "Query engine stream buffer size", qe.streamBufferSize.Get)
stats.NewGaugeFunc("TableACLExemptCount", "Query engine table ACL exempt count", qe.tableaclExemptCount.Get)

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

TableACLExemptCount is a Counter.

stats.NewGaugeFunc("QueryCacheLength", "Query engine query cache length", qe.plans.Length)
stats.NewGaugeFunc("QueryCacheSize", "Query engine query cache size", qe.plans.Size)
stats.NewGaugeFunc("QueryCacheCapacity", "Query engine query cache capacity", qe.plans.Capacity)
stats.NewGaugeFunc("QueryCacheEvictions", "Query engine query cache evictions", qe.plans.Evictions)

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

QueryCacheEvictions is a counter

// QPSRates shows the qps of QueryStats. Sample every 5 seconds and keep samples for up to 15 mins.
QPSRates = stats.NewRates("QPS", QueryStats, 15*60/5, 5*time.Second)
// WaitStats shows the time histogram for wait operations
WaitStats = stats.NewTimings("Waits")
WaitStats = stats.NewTimings("Waits", "Shows the time histogram for wait operations")

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

More description nits: "Shows the..." should be just "Wait operations" I think

"Errors",
"Number of critical errors that happened",

This comment has been minimized.

Copy link
@demmer

demmer Apr 13, 2018

Member

I think:

"Critical errors"
"Internal component errors"
...

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch from cf77ea9 to d6b3699 Apr 14, 2018

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 14, 2018

Ok!

Responded to your comments.

Major changes:

  • Changed up CounterFunc & GaugeFunc to take in a new type: MetricFunc that allows us to both export ints (like it did before) and now also durations.

I did a best effort pass through of which of the DurationFuncs were gauges/counters, but would love an extra eye on that @demmer.

Also rebased, made all the other changes I 👍'd to your comments on.

What's left:

  • Now that the toKebabCase 🍢 is merged, I'll take a look next week when I'm back to make the toSnake modifications we talked about in person.
  • As discussed in person, add logging / increment a counter if we ever call Add() with a negative number on a Counter. In future PRs, we can maybe ban such things once we're confident we've removed all existing cases from the codebase.

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch 2 times, most recently from 3ca8b93 to e298278 Apr 14, 2018

// For systems (like Prometheus) that ask for snake-case names.
// It converts CamelCase to camel_case, and CAMEL_CASE to camel_case.
// For numbers, it converts 0.5 to v0_5.
func toSnakeCase(name string) (hyphenated string) {

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 18, 2018

Author Collaborator

I added the toSnakeCase modifications based on the toKebabCase stuff. I'm not totally convinced it structurally makes sense to put it here (at minimum, I should rename the file if it otherwise makes sense to you).

Also, it shares a memoizer with the toKebabCase, which should be fine (one build should only use one or the other), but perhaps they should be separate in the code too?

@demmer, thoughts?

This comment has been minimized.

Copy link
@demmer

demmer Apr 19, 2018

Member

For safety and cleanliness I actually don't think they should share a memoizer.

Given that... I don't think there's actually any substantive shared code any more, so rather than misleadingly sharing the file name, I'd say you should just create a snake_case_converter.go with the new code and a separate memoizer.

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 18, 2018

@demmer
ok, finished up the rest of the items, although I have a few questions about where to put the toSnakeCase stuff structurally (commented in the code here: e298278#r182589219 )

Thoughts now?

@demmer

This comment has been minimized.

Copy link
Member

commented Apr 19, 2018

From my standpoint the only remaining item before I think this is good to merge is the minor reorganization of where the snake conversion goes.

@michael-berlin / @alainjobart since this will entail some breaking changes with internal Google importing I would like a 👍 from one of you that this is ok to merge.

@sougou told me in DM he doesn't care, so he doesn't get a vote on this one.

@sougou

This comment has been minimized.

Copy link
Member

commented Apr 19, 2018

I do care, but I know it's in good hands with @michael-berlin and @alainjobart :)

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch from 039db06 to 00d7afb Apr 19, 2018

// Add adds the provided value to the Counter
func (v *Counter) Add(delta int64) {
if delta < 0 {
log.Warningf("Adding a negative value to a counter, %v should be a gauge instead", v)

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 19, 2018

Contributor

How likely is it that this will get called?

If it's not likely at all because you already checked all callers in open-source, then I'm okay with it as is.

Otherwise, it may be a good idea to use a throttled logger here instead. This way we would avoid that bad code spams the log, makes the disk full and potentially slows down the process.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 19, 2018

Author Collaborator

Oh, that's a good idea. I've checked all the easy sites (Add(-1)) but there's a lot more that wind their way through the code that I don't have confidence on, hence logging for now instead of banning decrementing all together. I switched to using a throttled logger here. I also switched to using a throttled logger for the unsupported export type logger in prometheusbackend.go, thanks for the suggestion!

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2018

It looks like this is good to merge? If so, let's do as follows:

  1. I'll try to import it locally and make all necessary changes.
  2. If thinks are working, I'll submit the internal import.
  3. Shortly after that, I'll merge this PR.

Let me try to get to this today or tomorrow.

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 19, 2018

@michael-berlin That sounds awesome!

Let me figure out what's going on with these failing TravisCI tests after my last commit. The tests still succeed for me locally, so not sure. Debugging now.

// Counters is similar to expvar.Map, except that
// it doesn't allow floats. In addition, it provides
// a Counts method which can be used for tracking rates.
// it doesn't allow floats. It is used to build CountersWithLabels and GaugesWithLabels.
type Counters struct {

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 20, 2018

Contributor

Is it actually necessary to export this?

I saw that you're using it in the promstatsbackend but the usage was not immediately clear to me.

I'm asking because we have several type switches in the code base and I made the mistake of using "stats.Counter" instead of "stats.CountersWithLabels" in these type switches.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 20, 2018

Author Collaborator

Yeah I agree, it doesn't make sense. Let me subsume what it's doing into CountersWithLabels & GaugesWithLabels.

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 20, 2018

Author Collaborator

Done!


// CountersWithLabels provides a labelName for the tagged values in Counters
// It provides a Counts method which can be used for tracking rates.
type CountersWithLabels struct {

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 20, 2018

Contributor

Technically, this is only a single label and not multiple labels.

i.e. it should be named CountersWithLabel ?

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 20, 2018

Author Collaborator

Hmm. I don't have strong feelings about this but the plural-Labels was chosen because that's what the Prometheus golang client uses for naming, and it is multiple label values, if not multiple label keys.

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 20, 2018

Contributor

Okay. Let's keep it as is for now.

}

func TestGaugeFunc(t *testing.T) {
var gotname string

This comment has been minimized.

Copy link
@michael-berlin

michael-berlin Apr 20, 2018

Contributor

These two variables are unused.

Same for TestDurationFunc below.

Also: The Register() here seems to be unnecessary as well?

Can you please double-check this file and report back? Thanks!

This comment has been minimized.

Copy link
@zmagg

zmagg Apr 20, 2018

Author Collaborator

Oh thanks for the catch. I added tests against gotname and gotv for GaugeFunc and DurationFunc.

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2018

I have an internal import open with all the necessary changes to our internal code. Tests are passing now.

Can you please look at my last comments and address them in additional CLs? Once that's done, I'll patch them into my pending import as well and then we can wrap things up.

@zmagg zmagg force-pushed the tinyspeck:prom-metrics branch from ce9194c to 6de12e7 Apr 20, 2018

@zmagg

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 20, 2018

@michael-berlin Yay, awesome! I responded to your comments and rebased to resolve the merge conflict.

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2018

I ran into the merge conflict during the import as well. Therefore, I just reverted that PR.

Given that, can you please rebase again against the latest master?

zmagg added some commits Feb 28, 2018

WIP PR to add support for exporting metrics to Prometheus.
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
* Refactor the pull backend to use the expvar hook.
* Fix the nanoseconds => ints for histogram buckets.
* Gigantic rename of metrics:

-Int => Counter
-IntGauge => Gauge
-IntFunc => GaugeFunc
-Counters => CountersWithLabels
-MultiCounters => CountersWithMultiLabels
-MultiCounterFunc => CountersFuncWithMultiLabels
-Gauges => GaugesWithLabels
-MultiGauges=> GaugesWithMultiLabels

* Add an explicit labelName for CountersWithLabels & GaugesWithLabels

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Quick followup to @demmer's IRL comments that I could push some of the
pull_backend.go stuff into prombackend/prombackend.go's Register call
and make the pull_backend.go an interface only.

Also, fix some more types from the gigantic refactor, and run
GoLint/GoVet

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Get rid of the pull_backend.go
Fix tests to use the new metric names.
Fix tests that expected gauges and not counters. (more to come on this
thing, but these were the tests that indicated gauges early)
Fix the `Set()` implementation for GaugesWithMultiLabels.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Dedupe metric names with namespace names.
Add plugins for each of the components.

Rename prombackend => prometheusbackend
Rename publishPromMetric to publishPrometheusMetric

Rename Reset() => ResetAll()
ResetCounter => Reset()

Rename various Counters => Gauges per Mike's helpful pointers.
Rename a few help strings.
Remove the comments in go/vt/worker/worker.go metrics and move them 100% into help functions.
Add a GaugeFuncWithMultiLabels
Add a CounterFunc

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Support DurationFunc
Convert Nanoseconds to seconds

Call Init (handler) inside servenv.OnRun()

Add unit tests.

For counter, it doesn't make sense to have a function called ResetAll
and not one called Reset so re-rename that.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
- Changed up `CounterFunc` & `GaugeFunc` to take in a new type: `Metr…
…icFunc` that allows us to both export ints (like it did before) and now also durations.

- Moved Prom specific stuff in the plugins to the prometheus backend
itself.
- A ton of stats help string fix ups.
- some counters are gauges, and some gauges are counters! Fix those.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Write a modification to kebab case converter for converting to snake
case.

Log an warning when we call Add() on a counter with a negative number.

Fix the double underscore in the prometheus exported metrics when we
dedupe the namespace.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Create new snake_case_converter.
Add a conversion function from - to _ and add unit test.
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Use the throttled logger for unsupported metric types and decrementing
counters.

Remove some unneccessary commented out code.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Initialize the throttledLogger before calling stats.Register()
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Remove new line.
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Add gotname & gotv tests to export_test.go
Don't export counters. Refactor prometheusbackend.go code to collect
CountersWithLabels and GaugesWithLabels explicitly instead of relying on
underlying unexported-now Counters.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

@michael-berlin michael-berlin force-pushed the tinyspeck:prom-metrics branch from 6de12e7 to 8b2491a Apr 20, 2018

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2018

I just did it myself and pushed it to your branch.

stats: Add Counts() method to multi counter type.
An existing test used this functionality. When we stopped exporting the
"counters" type, the test lost that functionality. Instead of relying on
an implementation detail, there is a proper public method now.

Signed-off-by: Michael Berlin <mberlin@google.com>
@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2018

Internal import is done.

I'll merge this as soon as Travis has passed.

@michael-berlin

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2018

The race unit test did flake and therefore test shard 0 failed. I'm ignoring that and will merge this now.

@michael-berlin michael-berlin merged commit 2abc74a into vitessio:master Apr 21, 2018

3 of 4 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
DCO All commits have a DCO sign-off from the author
Details
codeclimate All good!
Details
deploy/netlify Deploy preview ready!
Details

michael-berlin added a commit to michael-berlin/vitess that referenced this pull request Apr 28, 2018

stats: Simplify code for single-value stat variables.
vitessio#3784 introduced two layers of code for "stats.Counter" (and the new type "stats.Gauge").

As a consequence, it was for example possible to create a "stats.Gauge" for a time.Duration value.

This approach did not work well with our internal usage of the stats package.

Therefore, I reversed this and simplified the code:

- MetricFunc() interface was removed. Instead, a CountFunc or a GaugeFunc requires a simple func() int64 as input.
- IntFunc() removed. Before vitessio#3784, it was used to implement the expvar.Var interface. But now, this is taken care of by the types itself e.g "stats.CounterFunc". Therefore, we do not need it anymore and users of the "stats" package can just pass a plain func() int64.
- Added types "Duration" and "DurationFunc" back.
- Added "Variable" interface. This allowed to simplify the Prometheus code which depends on the Help() method for each stats variable.
- Prometheus: Conversion to float64 values is now done in prometheusbackend.go and removed from the stats package.

BUG=78571948

Signed-off-by: Michael Berlin <mberlin@google.com>

mcuadros pushed a commit to src-d/go-vitess that referenced this pull request Sep 25, 2018

stats: Simplify code for single-value stat variables.
vitessio/vitess#3784 introduced two layers of code for "stats.Counter" (and the new type "stats.Gauge").

As a consequence, it was for example possible to create a "stats.Gauge" for a time.Duration value.

This approach did not work well with our internal usage of the stats package.

Therefore, I reversed this and simplified the code:

- MetricFunc() interface was removed. Instead, a CountFunc or a GaugeFunc requires a simple func() int64 as input.
- IntFunc() removed. Before vitessio/vitess#3784, it was used to implement the expvar.Var interface. But now, this is taken care of by the types itself e.g "stats.CounterFunc". Therefore, we do not need it anymore and users of the "stats" package can just pass a plain func() int64.
- Added types "Duration" and "DurationFunc" back.
- Added "Variable" interface. This allowed to simplify the Prometheus code which depends on the Help() method for each stats variable.
- Prometheus: Conversion to float64 values is now done in prometheusbackend.go and removed from the stats package.

BUG=78571948

Signed-off-by: Michael Berlin <mberlin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.