Skip to content

Conversation

@jlfwong
Copy link
Collaborator

@jlfwong jlfwong commented Nov 16, 2017

When we attempt to take a sample during a garbage collection, instead of just incrementing a counter, we can include a sample in the raw output corresponding to a fake stake frame indicating that we're in the middle of a garbage collection. In this case, we include a fake name of (garbage collection) and an empty file name.

Fixes #86

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

I'm not totally sure how to test this properly -- the test I updated passes, but there are tests that fail before and after my changes for me.

I'm running bundle exec rake test. My ruby version is:

ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin15]

Here are the tests that fail for me:

  1) Failure:
StackProfTest#test_object_allocation [/Users/jlfwong/code/stackprof/test/test_stackprof.rb:44]:
--- expected
+++ actual
@@ -1 +1 @@
-"block in StackProfTest#test_object_allocation"
+"StackProfTest#test_object_allocation"



  2) Failure:
StackProfTest#test_cputime [/Users/jlfwong/code/stackprof/test/test_stackprof.rb:68]:
Expected: "block in StackProfTest#math"
  Actual: "StackProfTest#math"


  3) Failure:
StackProfTest#test_raw [/Users/jlfwong/code/stackprof/test/test_stackprof.rb:108]:
--- expected
+++ actual
@@ -1 +1 @@
-"block (2 levels) in StackProfTest#test_raw"
+"StackProfTest#test_raw"



  4) Failure:
StackProfTest#test_custom [/Users/jlfwong/code/stackprof/test/test_stackprof.rb:93]:
--- expected
+++ actual
@@ -1 +1 @@
-"block (2 levels) in StackProfTest#test_custom"
+"StackProfTest#test_custom"

There are also various ways in which the code seems to crash with my changes:

..........FF...F../Users/jlfwong/code/stackprof/test/test_stackprof.rb:125: [BUG] Illegal instruction at 0x007fffd288ab44

and

.........F.FF.FF/Users/jlfwong/code/stackprof/test/test_stackprof.rb:125: [BUG] rb_gc_mark(): 0x007fed6f9172d8 is T_NONE

#undef S

gc_hook = Data_Wrap_Struct(rb_cObject, stackprof_gc_mark, NULL, &_stackprof);
_stackprof.fake_gc_frame = rb_str_new_literal("fake_gc_frame");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to mark this as a global. It's getting GC'd otherwise.

rb_hash_aset(details, sym_name, name);
if (frame == _stackprof.fake_gc_frame) {
name = rb_str_new_literal("(garbage collection)");
file = rb_str_new_literal("");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth it to create these two strings once upfront and mark them as globals to re-use as well.

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

Feedback should be addressed! Thanks for the super fast turnaround time!

gc_hook = Data_Wrap_Struct(rb_cObject, stackprof_gc_mark, NULL, &_stackprof);

_stackprof.fake_gc_frame = rb_str_new_literal("fake_gc_frame");
rb_global_variable(&gc_hook);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get lost?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Will fix.

Copy link
Owner

@tmm1 tmm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that fix your crashes?

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

Still getting this on every 6th run or so, even after restoring the accidentally deleted line retaining gc_hook as a global variable.

........FFF.F/Users/jlfwong/code/stackprof/test/test_stackprof.rb:125: [BUG] Illegal instruction at 0x007fffd288ab44

Here's the c backtrace:

0   ruby                                0x000000010d814134 rb_vm_bugreport + 388
1   ruby                                0x000000010d6aed2a rb_bug_context + 490
2   ruby                                0x000000010d78ff38 sigsegv + 72
3   libsystem_platform.dylib            0x00007fffd2887bba _sigtramp + 26
4   stackprof.bundle                    0x000000010dd664d7 stackprof_results + 535
5   ruby                                0x000000010d8075be vm_call_cfunc + 302
6   ruby                                0x000000010d7f1181 vm_exec_core + 10049
7   ruby                                0x000000010d802114 vm_exec + 116
8   ruby                                0x000000010d80f5da invoke_block_from_c_splattable + 362
9   ruby                                0x000000010d7fd1fa rb_yield + 138
10  ruby                                0x000000010d64c9a9 rb_ary_each + 41
11  ruby                                0x000000010d8075be vm_call_cfunc + 302
12  ruby                                0x000000010d7f1181 vm_exec_core + 10049
13  ruby                                0x000000010d802114 vm_exec + 116
14  ruby                                0x000000010d80f5da invoke_block_from_c_splattable + 362
15  ruby                                0x000000010d7fd1fa rb_yield + 138
16  ruby                                0x000000010d65219a rb_ary_collect + 442
17  ruby                                0x000000010d8075be vm_call_cfunc + 302
18  ruby                                0x000000010d7f1181 vm_exec_core + 10049
19  ruby                                0x000000010d802114 vm_exec + 116
20  ruby                                0x000000010d800905 vm_invoke_proc + 293
21  ruby                                0x000000010d73d111 rb_proc_call + 81
22  ruby                                0x000000010d6b6f97 rb_exec_end_proc + 359
23  ruby                                0x000000010d6b74ce ruby_finalize_0 + 110
24  ruby                                0x000000010d6b75f5 ruby_cleanup + 277
25  ruby                                0x000000010d6b789d ruby_run_node + 61
26  ruby                                0x000000010d649faf main + 79

Another run that crashed yielded this stack trace:

0   ruby                                0x0000000101fd2134 rb_vm_bugreport + 388
1   ruby                                0x0000000101e6cd2a rb_bug_context + 490
2   ruby                                0x0000000101f4dee8 sigill + 72
3   libsystem_platform.dylib            0x00007fffd2887bba _sigtramp + 26
4   libsystem_platform.dylib            0x00007fffd288ab44 _os_unfair_lock_recursive_abort + 23

@tmm1
Copy link
Owner

tmm1 commented Nov 16, 2017

Getting the line number in stackprof_results where the crash is occurring would tell you what's going on. You'll want to compile the extension with CFLAGS="-ggdb -O0" and then run under lldb to catch the crash, then bt to see the backtrace with debug info.

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

Okay, I got a little bit farther. It took me a while to figure out how to run the tests under lldb, but I got it eventually by using ps aux | grep stackprof while the tests were running via bundle exec rake test. The crashes happen intermittently which makes debugging frustrating.

It looks like malloc is failing somehow?

Here's the backtrace (at least for one of the classes of crashes).

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fffd288ab44 libsystem_platform.dylib`_os_unfair_lock_recursive_abort + 23
    frame #1: 0x00007fffd288a465 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 181
    frame #2: 0x00007fffd27fd445 libsystem_malloc.dylib`szone_malloc_should_clear + 179
    frame #3: 0x00007fffd27fd332 libsystem_malloc.dylib`malloc_zone_malloc + 107
    frame #4: 0x00007fffd27fc2b0 libsystem_malloc.dylib`malloc + 24
    frame #5: 0x00000001003b95f8 stackprof.bundle`stackprof_record_sample_for_stack + 88
    frame #6: 0x00007fffd2887bba libsystem_platform.dylib`_sigtramp + 26
    frame #7: 0x00007fffd27ff707 libsystem_malloc.dylib`tiny_free_list_add_ptr + 110
    frame #8: 0x00007fffd28144fc libsystem_malloc.dylib`tiny_free_no_lock + 406
    frame #9: 0x00007fffd28150f3 libsystem_malloc.dylib`free_tiny + 671
    frame #10: 0x000000010007bc1a ruby`ruby_xfree + 58
    frame #11: 0x00000001000885f4 ruby`obj_free + 740
    frame #12: 0x0000000100087d63 ruby`gc_sweep_step + 483
    frame #13: 0x000000010007fc7b ruby`garbage_collect + 139
    frame #14: 0x0000000100084603 ruby`gc_start_internal + 467
    frame #15: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #16: 0x00000001001a88dc ruby`vm_exec_core + 11932
    frame #17: 0x00000001001b9114 ruby`vm_exec + 116
    frame #18: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #19: 0x00000001001b414a ruby`rb_yield_1 + 74
    frame #20: 0x00000001000c7c58 ruby`int_dotimes + 216
    frame #21: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #22: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #23: 0x00000001001b9114 ruby`vm_exec + 116
    frame #24: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #25: 0x000000010006f9e4 ruby`rb_ensure + 164
    frame #26: 0x00000001003b9ed1 stackprof.bundle`stackprof_run + 65
    frame #27: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #28: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #29: 0x00000001001b9114 ruby`vm_exec + 116
    frame #30: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #31: 0x00000001001b414a ruby`rb_yield_1 + 74
    frame #32: 0x00000001000c7c58 ruby`int_dotimes + 216
    frame #33: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #34: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #35: 0x00000001001b9114 ruby`vm_exec + 116
    frame #36: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #37: 0x00000001001b41fa ruby`rb_yield + 138
    frame #38: 0x00000001000039a9 ruby`rb_ary_each + 41
    frame #39: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #40: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #41: 0x00000001001b9114 ruby`vm_exec + 116
    frame #42: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #43: 0x00000001001b41fa ruby`rb_yield + 138
    frame #44: 0x000000010000919a ruby`rb_ary_collect + 442
    frame #45: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #46: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #47: 0x00000001001b9114 ruby`vm_exec + 116
    frame #48: 0x00000001001b7905 ruby`vm_invoke_proc + 293
    frame #49: 0x00000001000f4111 ruby`rb_proc_call + 81
    frame #50: 0x000000010006df97 ruby`rb_exec_end_proc + 359
    frame #51: 0x000000010006e4ce ruby`ruby_finalize_0 + 110
    frame #52: 0x000000010006e5f5 ruby`ruby_cleanup + 277
    frame #53: 0x000000010006e89d ruby`ruby_run_node + 61
    frame #54: 0x0000000100000faf ruby`main + 79
    frame #55: 0x00007fffd267b255 libdyld.dylib`start + 1

I'm also not sure I set the debug flags correctly. To do that part, I made the following change:

diff --git i/ext/stackprof/extconf.rb w/ext/stackprof/extconf.rb
index 7d1a52c..ad29963 100644
--- i/ext/stackprof/extconf.rb
+++ w/ext/stackprof/extconf.rb
@@ -3,7 +3,10 @@ if have_func('rb_postponed_job_register_one') &&
    have_func('rb_profile_frames') &&
    have_func('rb_tracepoint_new') &&
    have_const('RUBY_INTERNAL_EVENT_NEWOBJ')
-  create_makefile('stackprof/stackprof')
+
+   with_cflags('-ggdb -O0') do
+     create_makefile('stackprof/stackprof')
+   end
 else
   fail 'missing API: are you using ruby 2.1+?'
 end

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

It looks like there are at least 2 classes of failure. Here's a different one:

.Process 8145 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x107e5a000)
    frame #0: 0x00000001003ba4d7 stackprof.bundle`stackprof_results + 535
stackprof.bundle`stackprof_results:
->  0x1003ba4d7 <+535>: movq   (%rax,%r12), %rdi
    0x1003ba4db <+539>: addq   $0x8, %r12
    0x1003ba4df <+543>: decq   %r13
    0x1003ba4e2 <+546>: jne    0x1003ba4c0               ; <+512>
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x107e5a000)
  * frame #0: 0x00000001003ba4d7 stackprof.bundle`stackprof_results + 535
    frame #1: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #2: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #3: 0x00000001001b9114 ruby`vm_exec + 116
    frame #4: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #5: 0x00000001001b414a ruby`rb_yield_1 + 74
    frame #6: 0x00000001000c7c58 ruby`int_dotimes + 216
    frame #7: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #8: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #9: 0x00000001001b9114 ruby`vm_exec + 116
    frame #10: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #11: 0x00000001001b41fa ruby`rb_yield + 138
    frame #12: 0x00000001000039a9 ruby`rb_ary_each + 41
    frame #13: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #14: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #15: 0x00000001001b9114 ruby`vm_exec + 116
    frame #16: 0x00000001001c65da ruby`invoke_block_from_c_splattable + 362
    frame #17: 0x00000001001b41fa ruby`rb_yield + 138
    frame #18: 0x000000010000919a ruby`rb_ary_collect + 442
    frame #19: 0x00000001001be5be ruby`vm_call_cfunc + 302
    frame #20: 0x00000001001a8181 ruby`vm_exec_core + 10049
    frame #21: 0x00000001001b9114 ruby`vm_exec + 116
    frame #22: 0x00000001001b7905 ruby`vm_invoke_proc + 293
    frame #23: 0x00000001000f4111 ruby`rb_proc_call + 81
    frame #24: 0x000000010006df97 ruby`rb_exec_end_proc + 359
    frame #25: 0x000000010006e4ce ruby`ruby_finalize_0 + 110
    frame #26: 0x000000010006e5f5 ruby`ruby_cleanup + 277
    frame #27: 0x000000010006e89d ruby`ruby_run_node + 61
    frame #28: 0x0000000100000faf ruby`main + 79
    frame #29: 0x00007fffd267b255 libdyld.dylib`start + 1

@tmm1
Copy link
Owner

tmm1 commented Nov 16, 2017

That looks like classic signs of memory corruption, because some ruby object is not getting marked correctly.

}

if (_stackprof.raw_samples_capa <= _stackprof.raw_samples_len + num) {
while (_stackprof.raw_samples_capa <= _stackprof.raw_samples_len + (num + 2)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I think this fixes it.

I believe this code was previously unsafe.

This is checking to see if the capacity of raw_samples is sufficient to cover num new entries, but it really needs to cover up to num + 2 new entries, since each sample entry is

stack height, followed by n frame ids, followed by the sample count.

I think this was an issue even before this change, but I'm not able to reproduce a crash before this PR.

Regardless, I think this is safer now?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow.. why would this need to be a loop?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, because we just double every time.

Calling realloc multiple times seems silly. It should calculate the required size once, and then call realloc once.

Copy link
Collaborator Author

@jlfwong jlfwong Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, I suspect this will almost never need to call realloc more than once, because we set the initial buffer size to be 100 * the number of frames in the first sample, and for this to be called more than once, a given stack would have to be more than the size of all previous stacks combined. The while loop is just for safety, but I suspect in practice it'll never be a significant performance burden.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fair enough.

@tmm1
Copy link
Owner

tmm1 commented Nov 16, 2017

I see some random failures locally like:

  2) Failure:
StackProfTest#test_gc [/Users/tmm1/code/stackprof/test/test_stackprof.rb:132]:
Expected: 0
  Actual: 112

These should probably be attributed to the GC instead of counting as missed.

rb_postponed_job_register_one(0, stackprof_job_handler, 0);
if (rb_during_gc()) {
_stackprof.during_gc++;
rb_postponed_job_register_one(0, stackprof_job_handler, (void*)1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The list of signal-safe functions is very small, so calling into rb_hash_aset etc from the signal handler was not going to work reliably.

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

These should probably be attributed to the GC instead of counting as missed.

I guess it depends on what the miss count is meant to represent.
The miss count can be high there if the gc samples didn't have time to record.

It seems like you'd want to have the gc sample count to accurately reflect the number of sample in raw that correspond to garbage collection, and we can miss recording a sample if the next timer interrupt fires before the deferred jobs are run.

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

I'm happy to make the sample counts correspond however you'd like -- just need to understand what gc_samples and missed_samples is meant to represent. Before this PR, it was pretty clear, but after this PR, the meaning of gc_samples seems a little murkier.

@tmm1
Copy link
Owner

tmm1 commented Nov 16, 2017

rb_postponed_job_register_one will not queue up more than one job for the same function callback. So when the signal fires multiple times during a GC, the job_handler fires only once and adds a single GC frame. This means the extra signals are considered "missed", even though we technically know they all happened during GC.

I think it makes sense to use distinct job handlers for gc and regular callbacks, so they don't cancel each other out.

And it seems we should use another counter like "current_gc_samples" that keeps track of how much time the current GC iteration took, then use that when the job finally fires and reset it at the end. Does that make sense?

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 16, 2017

Yep, that makes sense! Thanks for the clear explanation

@tmm1
Copy link
Owner

tmm1 commented Nov 17, 2017

Diff looks great. Should I merge it?

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 17, 2017

Well it looks like CI is failing? Do I need to address the failures there?

Seems like the code also yields some C90 warnings?

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 17, 2017

Looks like rb_str_new_literal doesn't exist in Ruby 2.1

@tmm1
Copy link
Owner

tmm1 commented Nov 17, 2017

That's probably worth fixing.

I just turned the CI on so I don't think anything other than 2.2 was passing before either.

@tmm1
Copy link
Owner

tmm1 commented Nov 17, 2017

Looks like rb_str_new_literal doesn't exist in Ruby 2.1

You can detect if its available via extconf, and then add a wrapper using ifndef

@jlfwong
Copy link
Collaborator Author

jlfwong commented Nov 17, 2017

Well, that seems to have fixed the compilation failures. There are still test failures I'm not too sure what to do about. Some of them look like they're just differences in implementation, and some of them look like they probably randomly pass or succeed.

Is there anything you'd like me to do to address that? I think some of those problems predate this PR.

@tmm1
Copy link
Owner

tmm1 commented Nov 17, 2017

LGTM. I'm going to merge this. The obvious test failures due to method name being returned can probably be fixed with a RUBY_VERSION check in the test. But that can happen in a separate PR.

@tmm1 tmm1 merged commit 1352832 into tmm1:master Nov 17, 2017
@tmm1
Copy link
Owner

tmm1 commented Nov 17, 2017

Tests are green on master. Thanks for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants