Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data doesn't seem to be graphing correctly... #32

Closed
jimrubenstein opened this issue Aug 17, 2011 · 15 comments
Closed

Data doesn't seem to be graphing correctly... #32

jimrubenstein opened this issue Aug 17, 2011 · 15 comments

Comments

@jimrubenstein
Copy link

I doubt this is relevant to the statsd daemon as much as carbon/graphite - but I figure there is a high concentration of users who are using this stack here, so I'd post it here to see if anyone is having the same problem.

I'm writing to several incrementers with statsd, and when i try to visualize/graph those incrementers, the graph is only showing a fraction of the amount of data I expect to see.

As an example, if I graph the integral values for a metric (named "processed") for the time period of today (midnight this am to midnight tonight, 8/17/11) the top value is currently about 225. However, if I grep the carbon updates.log file on my server, for today, there are around 4800 records saying carbon received updates for this value.

Am I missing something here? Should carbon be under-reporting this data so much? am I doing something wrong? I'm using the PHP statsd client, and calling statsd::increment('entries'); to increment the metric. Has anyone experienced this same problem?

Thanks!

@tabletcorry
Copy link

Which metric are you looking at in graphite? Also, what are the contents of your storage-schemas.conf file?

I had an issue like this, and it was primarily caused by an incorrect whisper configuration.

@jimrubenstein
Copy link
Author

The metric name is processed, so as it's reported to statsd: stats.processed. My storage-schemas.conf looks like this:

[everything]
priority = 100
pattern = .*
retentions = 60:43200,300:25920,3600:87600

I'm trying to update the config now to reflect what the readme.md says is being used; however i'm getting errors about the storage schema not matching some metrics, so i'm trying to fix that too. longsigh heh.

@jimrubenstein
Copy link
Author

I didn't mean to close the issue, my bad.

@jimrubenstein
Copy link
Author

Fixed the problem with metrics nto matching my schema, so my storage-schemas.conf looks like this now:

[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974

[everything_1min_1day]
priority = 100
pattern = .*
retentions = 60:43200

which reflects the default in the Readme.md and the example in the graphite conf folder, storage-schemas.conf.example

I'm going to watch this for a few minutes and see if it looks accurate.

@tabletcorry
Copy link

Yep, the issue is in your storage-schemas.conf... and the one in the README only kinda works (the regex doesn't match stats_counts)

Basically, the smallest retention must be 10 seconds (or whatever statsd is using as its sleep period). What you do after that doesn't matter, but the first one does.

Graphite is a very simple system, and if a new data point is in the same time period as a previous one... I think it just overwrites it. So, the first 5 statsd datasets (the first 50s of your 60s retention period) are being deleted by the final set. Finally, since statsd sends 0 if no increment operations occurred, it tends to wipe out data completely on occasion.

tl;dr: set your first retention period in carbon to 10s

@jimrubenstein
Copy link
Author

I set the 10s retention, and things are starting to look a bit more accurate. It's only been about 7 minutes, so it's hard to tell.

After some reading, I see what you mean about "last update wins" in carbon. It's worth noting, however, that as of version .9.8 there is an aggregation daemon that looks like it can get by this limitation. We're running an older version, and it's kind of embarrassing that I didn't put much thought into the graphs not looking like they were correct until very recently (last day or two).

Might be worth updating the documentation to mention that the 10s retention period is required, or results will be inconsistent, heh.

Thanks for your help, tabletcorry - much appreciated!

@jimrubenstein
Copy link
Author

After adjusting the retention configurations, and letting data collect overnight; it's apparent that the data that's being stored/recorded is not accurate. I don't really know what's going on. I do know that I had almost 10k increment calls to the entries metric last night, and when I got in this morning, graphite was showing a total of just over 900 as an integral value for that key.

After some messing with some arbitrary calls to statsd from some command line scripts, it looks like there's some kind of issue with how statsd publishes data to graphite every 10 seconds. For now, I tweaked my storage schema to have a minimum resolution to 1 second, and modified my statsd config to flush every second. This yeilds data that is more in line with what's going on with my web-app. This is better, but I'm afraid that once my 1 day retention of 1 second intervals expires and the 10 second retention takes over, i'm going to have numbers that don't make sense again.

Clearly something is wrong here, or I'm incorrectly expecting something from statsd/graphite that it's not meant to provide. I'm expecting to see a literal representation of the tracked metrics via graphite as they happened across my web cluster. Is this an unrealistic/incorrect expectation on what this stack is supposed to provide?

@tabletcorry
Copy link

There are two items that should appear in graphite; stats and stats_counts.

stats is an average of the increments over the flush interval. Thus, this is usually much less than the number of increments.

stats_counts is a literal counter of the increments, and should not be inaccurate.

So, if you just look at stats_counts you should see what you want (without the modifications).

Edit: Removed my mistake on the stats average behavior.

@yuvadm
Copy link

yuvadm commented Aug 23, 2011

I'm seeing a similar issue (and I share Jim's proposition that it's not necessarily StatsD that is the culprit).

I boiled the case down to a single, new, bucket that I increment once. The newly added bucket shows up on whisper and on graphite, but the value is never actually incremented. I'm using the same retention periods etsy uses:

[everything_etsy_style]
priority = 100
pattern = .*
retentions = 10:2160,60:10080,600:262974

@jimrubenstein
Copy link
Author

I was using an older version of the statsD daemon, before stats_counts had been added to the reporting.

What I've found supports what tabletcorry has said:

The stats bucket in graphite ends up being an "average of averages" value. This is because when statsD records the data for the metrics, increments them internally for 10 seconds. After a 10 second period, it takes the current value for each metric, and divides it by 10 (to get the average events per second for this metric). It then reports that value to graphite, which stores the value for that 10 second period in the whisper database.

Now, according to graphite's documentation, after your minimum retention time in graphite, the next retention (60 seconds in your example) becomes an average of six 10 second interval values; so, 1 minute's worth of data divided by 6.

Documentation Excerpt:

When data moves from a higher precision to a lower precision, it is averaged.
This way, you can still find the total for a particular time period if you know the original precision.

Now, as tabletcorry mentioned, there is a stats_counts bucket in the newer versions of statsD. This bucket does not average the values over the 10 second flushing period, and is an accurate count of what happened in that 10 second period. This data is reported to graphite as this raw count, so these numbers will be an accurate representation of the events that happened for any 10 second interval for each metric.

However you still have the problem that graphite presents, in that it averages the values for each metric as it moves to lower precision. To work around this, you'd have to lose the lower precision retentions, and store a whole bunch of high precision values. So, to get 5 years retention of 10 second interval data, you'd need 3,153,600 values at 10 seconds. I don't think it's realistic to keep 5 years of 10 second precision, I doubt many of us will be going back and using graphite for historical heuristic analysis - at least, not as far back as 5 years. I set my 10 second retention to 30 days.

tl;dr:

  • Old versions of statsD average your data over 10s period
  • New versions report 10s average and 10s real values
  • Graphite averages your data as it moves from high to low precision
  • Increase the high precision retention to get a longer, more accurate, history of your data
  • Upgrade statsD to make sure you have the stats and stats_counts buckets in graphite.

ps: gg github for not parsing all the markdown :

@jondot
Copy link

jondot commented Dec 26, 2011

Sorry to reiterate on this, I've configured to what looks like the best practice out of this convo. I ran a script that posts +1 increment every second for infinity.
This is the graph I'm getting
http://imageshack.us/f/37/statsvu.png/

the count is not increasing - is this the count for 10s period of time? does the count get reset every 10s?

Thanks

@jimrubenstein
Copy link
Author

@jondot what does your configuration file look like? That is really the key to how your graph ends up looking.

Also, which version of graphite are you using? I started a stack overflow topic about this a while back, and recently someone posted some new information regarding resolution-roll-up aggregation in newer versions of graphite. You can see here: http://stackoverflow.com/questions/7099197/tracking-metrics-using-statsd-via-etsy-and-graphite-graphite-graph-doesnt-se/8545821#8545821

@jondot
Copy link

jondot commented Dec 28, 2011

Hey @jimrubenstein, i'm using Graphite 0.9.9, statsd - the latest version. statsd on default config - reporting every 10s.
for Carbon storage, i've taken taken the recommended etsy configuration:

    [everything_etsy_style]
    priority = 100
    pattern = .*
    retentions = 10:2160,60:10080,600:262974

btw - on the screenshot it is visible that my resolution is towards 10s (so none of the 6hour issues you've bumped into)

@jimrubenstein
Copy link
Author

@jondot so, in your set-up what you're seeing is this:

StatsD only reports to graphite every 10 seconds. You may report every second to StatsD, but it sums up a 10 second period of reporting, and sends it to graphite. Once a 10s period has been reported to graphite, you'll never see it increase. You'll only ever see data update at a 10 second interval to graphite, since that's only as fast as StatsD reports it.

So, the answer to your question is: yes. The count is only for that 10 second period, and it gets reset every 10 seconds. You can perform function calls in graphite which will sum up all the 10s interval reports over whatever period of time you specify. Concretely, if you can do sum(1 minute) and it will add up all the 10 second reports over a minute and you'll see a count increase over a 1 minute period then reset to zero for the next minute.

Does that make sense?

@jondot
Copy link

jondot commented Dec 28, 2011

Thanks @jimrubenstein it does make sense. Didn't know Graphite would work that way, kinda hard finding the right documentation. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants