Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

90%: Tripod stats factory #94

Merged
merged 21 commits into from
Sep 18, 2015
Merged

90%: Tripod stats factory #94

merged 21 commits into from
Sep 18, 2015

Conversation

rsinger
Copy link
Member

@rsinger rsinger commented Sep 4, 2015

This builds on the work from #70 but takes a somewhat different approach to how Stats objects are passed around.

This maintains the current Tripod pattern of passing a Stats object to Tripod\Mongo\Driver as part of the constructor, then passes enough stats config for the background jobs to instantiate the same stats objects from a factory class.

@rsinger rsinger changed the title 0%: Tripod stats factory spike 10%: Tripod stats factory spike Sep 4, 2015
@rsinger rsinger changed the title 10%: Tripod stats factory spike 30%: Tripod stats factory spike Sep 8, 2015
@rsinger
Copy link
Member Author

rsinger commented Sep 8, 2015

Some questions that are probably best answered by @RobotRobot, which would be relevant to either this PR or #70:

  • We currently aren't keeping track of how many subjects are in a given DiscoverImpactedSubjects or ApplyOperation job: this seems like an significant omission, since we'd have no idea if a very long running job was 1 subject or 1,000.
  • The StatsD class doesn't keep track of 'store' or 'pod', so we have no granularity if some databases/collections are incredibly slow or fast or whatever.
  • Do we want to track the timing of each individual operation by operation type (e.g. tables, view, search) in ApplyOperation?

* @var array The original read preference gets stored here
* when changing for a write.
*/
private $originalCollectionReadPreference = array();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have these been removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from #70

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch was branched off of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually see any indication that they were referenced anywhere. It looks like it might have been intended to be what the same properties on Updates.class.php became.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok

@rsinger
Copy link
Member Author

rsinger commented Sep 9, 2015

Added a lot of new stats calls: I noticed that we really only called increment() when operations failed (since I guess timing doesn't really mean much, in that case), but I feel that might be an oversight. It seems like it's as important to record the frequency of an operation as it does the length of time to perform it. It's possible that this is unnecessary, i.e. maybe graphite already does this for us simply with the timer stats, but this is definitely something I'd need to defer to @RobotRobot and @foomatty for guidance.

I also added a 'custom' method to ITripodStat which allows the dev to keep arbitrary stats besides counts and timers. StatsD has 'gauges', so for that class, this custom method just uses that. With this, I've added stats for how many subjects are in an ApplyOperation job and how many impacted subjects are found in a DiscoverImpactedSubjects job.

I don't really know at what threshold these stats become noise (or problematic to graphite), but I figure I can add a lot in the PR (where we can see them all) and rip them out after discussion if need be.

@rsinger
Copy link
Member Author

rsinger commented Sep 9, 2015

Ah, I noticed that for StatsD->timer(), we're also sending an increment ("1|c","$duration|ms"), so I guess those added increments are unnecessary.

@rsinger rsinger changed the title 30%: Tripod stats factory spike 70%: Tripod stats factory Sep 11, 2015
@rsinger rsinger changed the title 70%: Tripod stats factory 90%: Tripod stats factory Sep 11, 2015
define('MONGO_CREATE_SEARCH_DOC','MONGO_CREATE_SEARCH_DOC');
define('MONGO_CONNECTION_ERROR','MONGO_CONNECTION_ERROR');

define('STAT_TYPE_COUNT', 'count');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to SUBJECT_COUNT, that's what you're counting.

@scaleupcto
Copy link
Contributor

As discussed I think the use of gauges is wrong. Gauges measure a value in a single context, i.e. the speed of a single car, the %load of a CPU, at a single point in time.

So what you have here is analogous to the amount of subjects being processed at any one time. The last job to post a value will overwrite any other jobs running in parallel in a given time window.

Instead I think we should simplify this to be able to increment by n (default 1). That way each parallel running job just contributes to the value by incrementing it in a given minute, not overwriting the value of others.

@rsinger
Copy link
Member Author

rsinger commented Sep 18, 2015

New stats added by the branch:

  • MONGO_VIEW_CACHE_MISS: an increment of requests for a View that needed to be regenerated from CBD
  • MONGO_TABLE_ROWS_DISTINCT: an increment of Tables->distinct() calls
  • MONGO_QUEUE_DISCOVER_JOB.subjects_count: keeps track of how many subjects are identified in a single DiscoverImpactedSubjects job
  • MONGO_QUEUE_DISCOVER_SUBJECT: an increment of how many subjects total are impacted, also a timer
  • MONGO_QUEUE_DISCOVER_FAIL: an increment of failed discover jobs
  • MONGO_QUEUE_APPLY_OPERATION.[operation_type]: an increment of specific apply operation job for each subject (e.g. generate_table_rows, generate_views, search), also a timer
  • MONGO_QUEUE_APPLY_OPERATION_JOB.subject_count: keeps track of how many subjects are sent to a single ApplyOperation job
  • MONGO_QUEUE_APPLY_OPERATION_FAIL: an increment of failed apply jobs

rsinger added a commit that referenced this pull request Sep 18, 2015
@rsinger rsinger merged commit a632d3a into master Sep 18, 2015
@rsinger rsinger deleted the tripod-stats-factory-spike branch September 18, 2015 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants