-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging and metrics #4
Comments
I explored using |
I added a grafana dashboard for dev as well, and put the relevant tapalcatl panels in there. Of note:
@zerebubuth, @iandees, @nvkelso could you take a quick look to review if it makes sense? |
Looks pretty good to me. When I suggested the "all", I meant it so you could generate graphs that showed percentage of all requests that had errors of some sort. While experimenting with that it looks like I may have changed it: That doesn't quite look like I would expect it to. It might be because the traffic is so low there, but I wouldn't expect a 700+% value. 🤔 |
@rmarianski I dig the dashboard. We made a couple changes to it together on the hangout to clarify the titles so someone coming at this fresh could get up-to-speed faster. There is one followup to measure the time it takes for various operations related to getting the S3 zip asset, uncompressing it, and returning the requested format. For posterity: it seems like Graphana samples on an interval which distorts the numbers displayed for counts, but the overall trends look good. Maybe there's a setting for that (like we've had to configure in CloudWatch). And then we'll need to add these to prod dashboard as well :) |
After chatting in person with Rob a bit I realized I should probably clarify my comment above. The weird percentage I noted is a result of looking at the wrong stats bucket. Our apps write out counter increment messages, statsd throws those increments into buckets with some configurable time-width. I think the default is one second. Every second, statsd flushes each bucket and emits two values: (1) the In the case of my screenshot above, we were using the We also chatted a bit about the "all" metrics that Rob is emitting now. I clarified that what I meant was we should emit a counter increment for every request and then emit separate increments for the conditional or error cases. Each of those conditional/error cases doesn't need its own "all" counter because the "every request" counter will cover us. |
Were the Graphana charts updated to reflect `stats_counts` prefix?
…On Thu, Jan 19, 2017 at 12:41 PM, Ian Dees ***@***.***> wrote:
After chatting in person with Rob a bit I realized I should probably
clarify my comment above. The weird percentage I noted is a result of
looking at the wrong stats bucket.
Our apps write out counter increment messages, statsd throws those
increments into buckets with some configurable time-width. I think the
default is one second. Every second, statsd flushes each bucket and emits
two values: (1) the stats-prefixed metric giving you a *rate*: the number
in the bucket divided by the bucket width (in time) and (2) the
stats_counts-prefixed metric that gives you the *count* of things in the
bucket over that time.
In the case of my screenshot above, we were using the stats-prefixed
*rate* value, so it was giving us unexpected numbers. When we switched to
the stats_counts prefix, we got a count of requests and the percentage
made much more sense.
We also chatted a bit about the "all" metrics that Rob is emitting now. I
clarified that what I meant was we should emit a counter increment for
every request and then emit separate increments for the conditional or
error cases. Each of those conditional/error cases doesn't need its own
"all" counter because the "every request" counter will cover us.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA0EO3F3je6Cd8OO32I1uYZ4lWM0yFiIks5rT8qLgaJpZM4LXq9y>
.
|
Related to #3, we'll want to be able to track whether tapalcatl was able to satisfy the request from s3 or not. If tapalcatl will continue to act as proxy for tileserver, it will need to expose this information. If not, then the fastly config should get updated accordingly.
Additionally, we should think about logging anything that might be useful to capture for metrics/analytics/alerting purposes. If it's something that we want to capture in redshift for longer term storage, it's easiest to stick the relevant information in a response header and have the fastly log parsing processes handle it.
The text was updated successfully, but these errors were encountered: