Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent hanging when sample collection time > interval #124

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

benbuckman
Copy link
Contributor

This attempts to fix the bug described in #123, in which, when interval is faster than the time required to record and analyze stack frames, the job queue piles up faster than it's flushed and the program hangs.

Pieces:

  • Moves the in_signal_handler lock to the global _stackprof struct, and adds a check, and escape hatch, in stackprof_signal_handler.
  • Adds a test which demonstrates the hanging behavior and the fix for it.
  • Adds helpers timeval_to_usec and diff_timevals_usec for working with time durations more easily as long ints of microseconds, rather than timeval structs.
  • Adds a debug mode with stdout+stderr output recording the internal timings and flow.
  • Adds some comments and renames some variables for clarity.

More context on the underlying bug, and consideration of alternative approaches, is at #123.

Motivation:
We would like to use StackProf on a large-scale Ruby application in production. The inability to set an interval that is both fast enough to produce useful data, and guaranteed to never hang the program, is a blocker to doing so.

Caveats:

  • I am an amateur with C. It took a lot of trial and error with this code and https://repl.it/languages/c to get this working.
  • I have not yet analyzed what the escape hatch in stackprof_signal_handler does to the quality of the final results. (At the end of the day, it's a kind of user or program error if the interval is so fast that the escape hatch is necessary. The question is how to deal with it.)
  • I have mostly tested this with wall mode; I have not yet analyzed how this bug or fix work with cpu or object modes (though I suspect they're similar).

cc @NickLaMuro, thank you for your engagement with #123!

Records information about the frames captured in `_stackprof.frames_buffer`,
up to `frame_count-1` (buffer may contain more frames from prior sample),
captured `timestamp_delta` microseconds after previous sample.
*/
Copy link
Contributor Author

@benbuckman benbuckman Nov 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is based on my understanding of the code, please correct if I'm misunderstanding. I'm also not sure what the canonical comment format should be here.)

if (_stackprof.raw) {
gettimeofday(&_stackprof.last_sample_at, NULL);
}
gettimeofday(&_stackprof.last_sample_at, NULL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It seems simpler, and trivial, to assign this always, rather than track the downstream paths that do or don't need it.)

struct timeval sampling_start, sampling_finish;

gettimeofday(&sampling_start, NULL);
long int time_since_last_sample_usec = diff_timevals_usec(&_stackprof.last_sample_at, &sampling_start);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note, this replaces the timestamp math fix in #122.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of long int, include stdint.h and use int64_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return MICROSECONDS_IN_SECOND * diff->tv_sec + diff->tv_usec;
}

long int diff_timevals_usec(struct timeval *start, struct timeval *end) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbuckman
Copy link
Contributor Author

@tmm1 Thanks for your feedback so far.
What do you think of this overall?
How would you like StackProf to handle the scenario described here – should it skip the signal handler (as currently implemented), or fill in fake frames, or stop profiling and raise a (e.g.) IntervalTooFast error later on StackProf.results (or at the end of run)?

@benbuckman
Copy link
Contributor Author

Thinking about it some more, another option would be to track and expose _stackprof.skipped_frames, and let the program use it after collecting the data, if it cares to.

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 11, 2019

_stackprof.skipped_frames

I see an integer missed_samples in the results,
defined at

S(missed_samples);

but it's never set in the C code, so it's always 0.
Was the original intention for missed_samples the same as I'm proposing here?
never mind, the search results missed where it's used,
rb_hash_aset(results, sym_missed_samples, SIZET2NUM(_stackprof.overall_signals - _stackprof.overall_samples));

It looks like missed_samples, as already implemented, would correctly count the number of signals skipped by the new escape hatch. Excellent.

@benbuckman benbuckman changed the title [WIP] Prevent hanging when sample collection time > interval Prevent hanging when sample collection time > interval Nov 11, 2019
@benbuckman benbuckman requested a review from tmm1 November 11, 2019 17:52
@benbuckman
Copy link
Contributor Author

@tmm1 || @tenderlove || other maintainers, do you have time to review this?
Thank you!

causes segfaults in tests 😞

This reverts commit ac49055.
- in `fprintf`s
- `_stackprof.interval` is an int
- `timestamp_delta` as `int64_t`
   - but doing the same with `_stackprof.raw_timestamp_deltas` makes it segfault!
@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 11, 2019

@tmm1 I'm using int64_t now as suggested, but it's inconsistent – _stackprof.raw_timestamp_deltas still holds ints – and attempting to change that ... causes segfaults in tests (https://travis-ci.org/tmm1/stackprof/builds/610510787).

Edit: this is fixed, I needed to also change the malloc size

it appears that at 1μs, performance hit of profiling causes test to take >10s to run, and time out.
try at 10μs instead.

if (_stackprof.debug) {
printf("started with interval %d (%ld sec %d usec)\n",
NUM2INT(interval), timer.it_interval.tv_sec, timer.it_interval.tv_usec);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This highlights the separate issue of #125)

if (raw) {
gettimeofday(&_stackprof.last_sample_at, NULL);
if (_stackprof.debug) {
printf("started with interval %d (%ld sec %d usec)\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This highlights the separate issue of #125)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants