Logging and metrics #4

rmarianski · 2016-12-29T15:56:54Z

Related to #3, we'll want to be able to track whether tapalcatl was able to satisfy the request from s3 or not. If tapalcatl will continue to act as proxy for tileserver, it will need to expose this information. If not, then the fastly config should get updated accordingly.

Additionally, we should think about logging anything that might be useful to capture for metrics/analytics/alerting purposes. If it's something that we want to capture in redshift for longer term storage, it's easiest to stick the relevant information in a response header and have the fastly log parsing processes handle it.

rmarianski · 2017-01-05T16:12:25Z

I explored using expvars for this because they are already used throughout, but it seems like those were meant just for printing out debug stats from a json endpoint. Since the docs say that the strings they print out are json, we could probably unmarshall them, but I'm on the fence as to whether that is better than doing something else in addition to expar vars.
Anyway, completing the initial thought i was thinking of having a goroutine running on a set interval, like minutely, and log a summary of the activity in that period. Then, we'd have something pick up that log entry and have it end up in a metric.
But, maybe it's better to just log the individual entries themselves and have any aggregation happen externally. That's probably a better option because changing strategies would mean the change would happen externally to the code itself.

rmarianski · 2017-01-19T17:58:02Z

I added a grafana dashboard for dev as well, and put the relevant tapalcatl panels in there. Of note:

I grouped the boolean counters into "counts" and "errors" (both graphs and in code) because I think it made sense to view them that way. grafana has a handy way of including all metrics that are grouped together which makes maintaining them easier.
these counts each have an "all", which ticks up regardless of whether any of the booleans are set. The reason for this is so it doesn't look like we're not getting any data, and to have an idea for what the max in that range would be.

@zerebubuth, @iandees, @nvkelso could you take a quick look to review if it makes sense?
https://grafana.mapzen.com/dashboard/db/tiles-dev-us-east

iandees · 2017-01-19T19:01:27Z

Looks pretty good to me. When I suggested the "all", I meant it so you could generate graphs that showed percentage of all requests that had errors of some sort.

While experimenting with that it looks like I may have changed it:

That doesn't quite look like I would expect it to. It might be because the traffic is so low there, but I wouldn't expect a 700+% value. 🤔

nvkelso · 2017-01-19T19:01:52Z

@rmarianski I dig the dashboard. We made a couple changes to it together on the hangout to clarify the titles so someone coming at this fresh could get up-to-speed faster.

There is one followup to measure the time it takes for various operations related to getting the S3 zip asset, uncompressing it, and returning the requested format.

For posterity: it seems like Graphana samples on an interval which distorts the numbers displayed for counts, but the overall trends look good. Maybe there's a setting for that (like we've had to configure in CloudWatch).

And then we'll need to add these to prod dashboard as well :)

iandees · 2017-01-19T20:41:47Z

After chatting in person with Rob a bit I realized I should probably clarify my comment above. The weird percentage I noted is a result of looking at the wrong stats bucket.

Our apps write out counter increment messages, statsd throws those increments into buckets with some configurable time-width. I think the default is one second. Every second, statsd flushes each bucket and emits two values: (1) the stats-prefixed metric giving you a rate: the number in the bucket divided by the bucket width (in time) and (2) the stats_counts-prefixed metric that gives you the count of things in the bucket over that time.

In the case of my screenshot above, we were using the stats-prefixed rate value, so it was giving us unexpected numbers. When we switched to the stats_counts prefix, we got a count of requests and the percentage made much more sense.

We also chatted a bit about the "all" metrics that Rob is emitting now. I clarified that what I meant was we should emit a counter increment for every request and then emit separate increments for the conditional or error cases. Each of those conditional/error cases doesn't need its own "all" counter because the "every request" counter will cover us.

nvkelso · 2017-01-20T00:10:09Z

Were the Graphana charts updated to reflect `stats_counts` prefix?

…

On Thu, Jan 19, 2017 at 12:41 PM, Ian Dees ***@***.***> wrote: After chatting in person with Rob a bit I realized I should probably clarify my comment above. The weird percentage I noted is a result of looking at the wrong stats bucket. Our apps write out counter increment messages, statsd throws those increments into buckets with some configurable time-width. I think the default is one second. Every second, statsd flushes each bucket and emits two values: (1) the stats-prefixed metric giving you a *rate*: the number in the bucket divided by the bucket width (in time) and (2) the stats_counts-prefixed metric that gives you the *count* of things in the bucket over that time. In the case of my screenshot above, we were using the stats-prefixed *rate* value, so it was giving us unexpected numbers. When we switched to the stats_counts prefix, we got a count of requests and the percentage made much more sense. We also chatted a bit about the "all" metrics that Rob is emitting now. I clarified that what I meant was we should emit a counter increment for every request and then emit separate increments for the conditional or error cases. Each of those conditional/error cases doesn't need its own "all" counter because the "every request" counter will cover us. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA0EO3F3je6Cd8OO32I1uYZ4lWM0yFiIks5rT8qLgaJpZM4LXq9y> .

rmarianski mentioned this issue Jan 4, 2017

Use a buffer pool #8

Closed

rmarianski mentioned this issue Jan 7, 2017

Update responsibilities and behavior #10

Merged

nvkelso added nextnext ready and removed nextnext labels Jan 17, 2017

This was referenced Jan 18, 2017

Use udp directly instead of buffering #12

Merged

Send boolean metrics conditionally #13

Merged

rmarianski self-assigned this Jan 19, 2017

rmarianski added in review and removed ready labels Jan 19, 2017

rmarianski mentioned this issue Jan 19, 2017

Add timing information to metrics #14

Merged

rmarianski closed this as completed Jan 24, 2017

rmarianski removed the in review label Jan 24, 2017

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging and metrics #4

Logging and metrics #4

rmarianski commented Dec 29, 2016 •

edited by nvkelso

Loading

rmarianski commented Jan 5, 2017

rmarianski commented Jan 19, 2017

iandees commented Jan 19, 2017

nvkelso commented Jan 19, 2017

iandees commented Jan 19, 2017

nvkelso commented Jan 20, 2017 via email

Logging and metrics #4

Logging and metrics #4

Comments

rmarianski commented Dec 29, 2016 • edited by nvkelso Loading

rmarianski commented Jan 5, 2017

rmarianski commented Jan 19, 2017

iandees commented Jan 19, 2017

nvkelso commented Jan 19, 2017

iandees commented Jan 19, 2017

nvkelso commented Jan 20, 2017 via email

rmarianski commented Dec 29, 2016 •

edited by nvkelso

Loading