Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging and metrics #4

Closed
rmarianski opened this issue Dec 29, 2016 · 6 comments
Closed

Logging and metrics #4

rmarianski opened this issue Dec 29, 2016 · 6 comments
Assignees

Comments

@rmarianski
Copy link
Member

rmarianski commented Dec 29, 2016

Related to #3, we'll want to be able to track whether tapalcatl was able to satisfy the request from s3 or not. If tapalcatl will continue to act as proxy for tileserver, it will need to expose this information. If not, then the fastly config should get updated accordingly.

Additionally, we should think about logging anything that might be useful to capture for metrics/analytics/alerting purposes. If it's something that we want to capture in redshift for longer term storage, it's easiest to stick the relevant information in a response header and have the fastly log parsing processes handle it.

@rmarianski
Copy link
Member Author

I explored using expvars for this because they are already used throughout, but it seems like those were meant just for printing out debug stats from a json endpoint. Since the docs say that the strings they print out are json, we could probably unmarshall them, but I'm on the fence as to whether that is better than doing something else in addition to expar vars.
Anyway, completing the initial thought i was thinking of having a goroutine running on a set interval, like minutely, and log a summary of the activity in that period. Then, we'd have something pick up that log entry and have it end up in a metric.
But, maybe it's better to just log the individual entries themselves and have any aggregation happen externally. That's probably a better option because changing strategies would mean the change would happen externally to the code itself.

@rmarianski
Copy link
Member Author

I added a grafana dashboard for dev as well, and put the relevant tapalcatl panels in there. Of note:

  • I grouped the boolean counters into "counts" and "errors" (both graphs and in code) because I think it made sense to view them that way. grafana has a handy way of including all metrics that are grouped together which makes maintaining them easier.
  • these counts each have an "all", which ticks up regardless of whether any of the booleans are set. The reason for this is so it doesn't look like we're not getting any data, and to have an idea for what the max in that range would be.

@zerebubuth, @iandees, @nvkelso could you take a quick look to review if it makes sense?
https://grafana.mapzen.com/dashboard/db/tiles-dev-us-east

@iandees
Copy link
Member

iandees commented Jan 19, 2017

Looks pretty good to me. When I suggested the "all", I meant it so you could generate graphs that showed percentage of all requests that had errors of some sort.

While experimenting with that it looks like I may have changed it:

image

That doesn't quite look like I would expect it to. It might be because the traffic is so low there, but I wouldn't expect a 700+% value. 🤔

@nvkelso
Copy link
Member

nvkelso commented Jan 19, 2017

@rmarianski I dig the dashboard. We made a couple changes to it together on the hangout to clarify the titles so someone coming at this fresh could get up-to-speed faster.

There is one followup to measure the time it takes for various operations related to getting the S3 zip asset, uncompressing it, and returning the requested format.

For posterity: it seems like Graphana samples on an interval which distorts the numbers displayed for counts, but the overall trends look good. Maybe there's a setting for that (like we've had to configure in CloudWatch).

And then we'll need to add these to prod dashboard as well :)

@iandees
Copy link
Member

iandees commented Jan 19, 2017

After chatting in person with Rob a bit I realized I should probably clarify my comment above. The weird percentage I noted is a result of looking at the wrong stats bucket.

Our apps write out counter increment messages, statsd throws those increments into buckets with some configurable time-width. I think the default is one second. Every second, statsd flushes each bucket and emits two values: (1) the stats-prefixed metric giving you a rate: the number in the bucket divided by the bucket width (in time) and (2) the stats_counts-prefixed metric that gives you the count of things in the bucket over that time.

In the case of my screenshot above, we were using the stats-prefixed rate value, so it was giving us unexpected numbers. When we switched to the stats_counts prefix, we got a count of requests and the percentage made much more sense.

We also chatted a bit about the "all" metrics that Rob is emitting now. I clarified that what I meant was we should emit a counter increment for every request and then emit separate increments for the conditional or error cases. Each of those conditional/error cases doesn't need its own "all" counter because the "every request" counter will cover us.

@nvkelso
Copy link
Member

nvkelso commented Jan 20, 2017 via email

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants