-
Notifications
You must be signed in to change notification settings - Fork 135
Add fake stack frames representing samples taken during garbage collection #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm not totally sure how to test this properly -- the test I updated passes, but there are tests that fail before and after my changes for me. I'm running Here are the tests that fail for me: There are also various ways in which the code seems to crash with my changes: and |
ext/stackprof/stackprof.c
Outdated
| #undef S | ||
|
|
||
| gc_hook = Data_Wrap_Struct(rb_cObject, stackprof_gc_mark, NULL, &_stackprof); | ||
| _stackprof.fake_gc_frame = rb_str_new_literal("fake_gc_frame"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to mark this as a global. It's getting GC'd otherwise.
ext/stackprof/stackprof.c
Outdated
| rb_hash_aset(details, sym_name, name); | ||
| if (frame == _stackprof.fake_gc_frame) { | ||
| name = rb_str_new_literal("(garbage collection)"); | ||
| file = rb_str_new_literal(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth it to create these two strings once upfront and mark them as globals to re-use as well.
|
Feedback should be addressed! Thanks for the super fast turnaround time! |
| gc_hook = Data_Wrap_Struct(rb_cObject, stackprof_gc_mark, NULL, &_stackprof); | ||
|
|
||
| _stackprof.fake_gc_frame = rb_str_new_literal("fake_gc_frame"); | ||
| rb_global_variable(&gc_hook); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this get lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! Will fix.
tmm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that fix your crashes?
|
Still getting this on every 6th run or so, even after restoring the accidentally deleted line retaining Here's the c backtrace: Another run that crashed yielded this stack trace: |
|
Getting the line number in stackprof_results where the crash is occurring would tell you what's going on. You'll want to compile the extension with CFLAGS="-ggdb -O0" and then run under lldb to catch the crash, then |
|
Okay, I got a little bit farther. It took me a while to figure out how to run the tests under lldb, but I got it eventually by using It looks like malloc is failing somehow? Here's the backtrace (at least for one of the classes of crashes). I'm also not sure I set the debug flags correctly. To do that part, I made the following change: |
|
It looks like there are at least 2 classes of failure. Here's a different one: |
|
That looks like classic signs of memory corruption, because some ruby object is not getting marked correctly. |
| } | ||
|
|
||
| if (_stackprof.raw_samples_capa <= _stackprof.raw_samples_len + num) { | ||
| while (_stackprof.raw_samples_capa <= _stackprof.raw_samples_len + (num + 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I think this fixes it.
I believe this code was previously unsafe.
This is checking to see if the capacity of raw_samples is sufficient to cover num new entries, but it really needs to cover up to num + 2 new entries, since each sample entry is
stack height, followed by n frame ids, followed by the sample count.
I think this was an issue even before this change, but I'm not able to reproduce a crash before this PR.
Regardless, I think this is safer now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow.. why would this need to be a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, because we just double every time.
Calling realloc multiple times seems silly. It should calculate the required size once, and then call realloc once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, I suspect this will almost never need to call realloc more than once, because we set the initial buffer size to be 100 * the number of frames in the first sample, and for this to be called more than once, a given stack would have to be more than the size of all previous stacks combined. The while loop is just for safety, but I suspect in practice it'll never be a significant performance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay fair enough.
|
I see some random failures locally like: These should probably be attributed to the GC instead of counting as missed. |
ext/stackprof/stackprof.c
Outdated
| rb_postponed_job_register_one(0, stackprof_job_handler, 0); | ||
| if (rb_during_gc()) { | ||
| _stackprof.during_gc++; | ||
| rb_postponed_job_register_one(0, stackprof_job_handler, (void*)1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The list of signal-safe functions is very small, so calling into rb_hash_aset etc from the signal handler was not going to work reliably.
I guess it depends on what the miss count is meant to represent. It seems like you'd want to have the gc sample count to accurately reflect the number of sample in |
|
I'm happy to make the sample counts correspond however you'd like -- just need to understand what |
|
rb_postponed_job_register_one will not queue up more than one job for the same function callback. So when the signal fires multiple times during a GC, the job_handler fires only once and adds a single GC frame. This means the extra signals are considered "missed", even though we technically know they all happened during GC. I think it makes sense to use distinct job handlers for gc and regular callbacks, so they don't cancel each other out. And it seems we should use another counter like "current_gc_samples" that keeps track of how much time the current GC iteration took, then use that when the job finally fires and reset it at the end. Does that make sense? |
|
Yep, that makes sense! Thanks for the clear explanation |
|
Diff looks great. Should I merge it? |
|
Well it looks like CI is failing? Do I need to address the failures there? Seems like the code also yields some C90 warnings? |
|
Looks like |
|
That's probably worth fixing. I just turned the CI on so I don't think anything other than 2.2 was passing before either. |
You can detect if its available via extconf, and then add a wrapper using ifndef |
|
Well, that seems to have fixed the compilation failures. There are still test failures I'm not too sure what to do about. Some of them look like they're just differences in implementation, and some of them look like they probably randomly pass or succeed. Is there anything you'd like me to do to address that? I think some of those problems predate this PR. |
|
LGTM. I'm going to merge this. The obvious test failures due to method name being returned can probably be fixed with a |
|
Tests are green on master. Thanks for working on this! |
When we attempt to take a sample during a garbage collection, instead of just incrementing a counter, we can include a sample in the
rawoutput corresponding to a fake stake frame indicating that we're in the middle of a garbage collection. In this case, we include a fake name of(garbage collection)and an empty file name.Fixes #86