Ensure VM is running in signal handler #200

ianks · 2023-03-16T05:12:46Z

As of a few weeks ago, there's been an increase of sigabrt in prod for us. After hunting the issue down a bit with @peterzhu2118, we honed in on a case where stackprof is signaled and ruby_current_vm_ptr is null. This seems to be happening after SIGQUIT'ing a unicorn worker during rb_during_gc.

* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007f10787f100b libc.so.6`raise + 203
    frame #1: 0x00007f10787d0859 libc.so.6`abort + 299
    frame #2: 0x0000562e2c372d4f ruby`rb_bug_for_fatal_signal.cold at error.c:776:5
    frame #3: 0x0000562e2c4d56cd ruby`sigsegv(sig=<unavailable>, info=<unavailable>, ctx=<unavailable>) at signal.c:964:5
    frame #4: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #5: 0x0000562e2c396b0b ruby`rb_during_gc at gc.c:11006:12
    frame #6: 0x00007f1070f45391 stackprof.so`stackprof_signal_handler at stackprof.c:745:34
    frame #7: 0x00007f1070f45382 stackprof.so`stackprof_signal_handler(sig=<unavailable>, sinfo=<unavailable>, ucontext=<unavailable>) at stackprof.c:720
    frame #8: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #9: 0x00007f1066be70f4 libselinux.so.1`___lldb_unnamed_symbol8$$libselinux.so.1 + 36
    frame #10: 0x00007f10790daf6b ld-linux-x86-64.so.2`___lldb_unnamed_symbol59$$ld-linux-x86-64.so.2 + 523
    frame #11: 0x00007f10787f48a7 libc.so.6`___lldb_unnamed_symbol191$$libc.so.6 + 247
    frame #12: 0x00007f10787f4a60 libc.so.6`exit + 32
    frame #13: 0x00007f10787d208a libc.so.6`__libc_start_main + 250
    frame #14: 0x0000562e2c373fde ruby`_start + 46

Since signal safety is hard to reason about, let's add a sanity check and not assume the Ruby VM is active just because the handler is called.

ianks · 2023-03-16T17:20:26Z

Since ruby_current_vm_ptr is not a public symbol, I had to declare it with as an extern. I think the symbol name has changed in various ruby versions, so it would be best to use stable API to do this check.

Is there any other public API that I can use instead to ensure the VM is running? Otherwise, I’ll plunge forward with detecting the various names of this symbol across Ruby versions.

ivoanjo · 2023-03-16T17:33:53Z

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

peterzhu2118 · 2023-03-16T17:39:34Z

On Ruby 2.4 and before, it should be called ruby_current_vm rather than ruby_current_vm_ptr. But at any rate, we shouldn't rely on ruby_current_vm_ptr being available for extensions because it isn't marked as public (it isn't wrapped within a RUBY_SYMBOL_EXPORT_BEGIN/RUBY_SYMBOL_EXPORT_END region). AFAIK there isn't a public API to determine if the Ruby VM is available.

ianks · 2023-03-16T17:41:22Z

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

That may help, and is probably a good change in its own right. I would still like to add this guard though since it improves the async signal safety of the handler so it's tricky to reason about.

tenderlove · 2023-03-16T21:22:06Z

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

I think stackprof should probably do this by default. Either that or its callees should check for existence of the VM.

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

ianks · 2023-03-16T21:52:35Z

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

Yeah I have similar feelings, and very open to other ways of solving this. Maybe a null check in rb_during_gc would be better?

peterzhu2118 · 2023-03-16T23:40:52Z

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

stackprof/ext/stackprof/stackprof.c

Lines 745 to 764 in 52d1df6

    
              if (!_stackprof.ignore_gc && rb_during_gc()) { 
        
           VALUE mode = rb_gc_latest_gc_info(sym_state); 
        
           if (mode == sym_marking) { 
        
               _stackprof.unrecorded_gc_marking_samples++; 
        
           } else if (mode == sym_sweeping) { 
        
               _stackprof.unrecorded_gc_sweeping_samples++; 
        
           } 
        
           _stackprof.unrecorded_gc_samples++; 
        
           rb_postponed_job_register_one(0, stackprof_job_record_gc, (void*)0); 
        
              } else { 
        
                  if (stackprof_use_postponed_job) { 
        
                      rb_postponed_job_register_one(0, stackprof_job_sample_and_record, (void*)0); 
        
                  } else { 
        
                      // Buffer a sample immediately, if an existing sample exists this will 
        
                      // return immediately 
        
                      stackprof_buffer_sample(); 
        
                      // Enqueue a job to record the sample 
        
                      rb_postponed_job_register_one(0, stackprof_job_record_buffer, (void*)0); 
        
                  } 
        
              }

ianks · 2023-03-17T04:46:18Z

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

stackprof/ext/stackprof/stackprof.c

Lines 745 to 764 in 52d1df6

if (!_stackprof.ignore_gc && rb_during_gc()) {

VALUE mode = rb_gc_latest_gc_info(sym_state);

if (mode == sym_marking) {

_stackprof.unrecorded_gc_marking_samples++;

} else if (mode == sym_sweeping) {

_stackprof.unrecorded_gc_sweeping_samples++;

}

_stackprof.unrecorded_gc_samples++;

rb_postponed_job_register_one(0, stackprof_job_record_gc, (void*)0);

} else {

if (stackprof_use_postponed_job) {

rb_postponed_job_register_one(0, stackprof_job_sample_and_record, (void*)0);

} else {

// Buffer a sample immediately, if an existing sample exists this will

// return immediately

stackprof_buffer_sample();

// Enqueue a job to record the sample

rb_postponed_job_register_one(0, stackprof_job_record_buffer, (void*)0);

}

}

I think you are right.

A good stop gap may be to register a new at_exit proc with our own variable (bool ruby_vm_exited) that we can test against, rather than using the VM*

ianks · 2023-03-20T16:00:15Z

@tenderlove @peterzhu2118 Thoughts on my updates?

tenderlove · 2023-03-20T16:01:15Z

@ianks lgtm! Thanks for the patch!

ianks · 2023-03-20T16:03:20Z

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

tenderlove · 2023-03-20T16:13:05Z

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

Ya, it's flaky. I'll merge this and ship it. I think we're trying to measure some GC stuff in the tests and it's not predictable.

ianks force-pushed the ruby-signal-safety branch from 8ebeff1 to b9e762e Compare March 16, 2023 05:15

Check that VM is running in sigaction

455d76a

ianks force-pushed the ruby-signal-safety branch from b9e762e to 455d76a Compare March 20, 2023 15:59

ianks changed the title ~~Guard ruby_current_vm_ptr is non-null in signal handler~~ Ensure VM is running in signal handler Mar 20, 2023

tenderlove merged commit eb1db9f into tmm1:master Mar 20, 2023
10 of 11 checks passed

ianks deleted the ruby-signal-safety branch March 21, 2023 14:11

ianks mentioned this pull request May 3, 2023

Ensure the VM is alive before accessing objspace in C API (Feature #19627) ruby/ruby#7783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure VM is running in signal handler #200

Ensure VM is running in signal handler #200

ianks commented Mar 16, 2023

ianks commented Mar 16, 2023

ivoanjo commented Mar 16, 2023

peterzhu2118 commented Mar 16, 2023

ianks commented Mar 16, 2023

tenderlove commented Mar 16, 2023

ianks commented Mar 16, 2023

peterzhu2118 commented Mar 16, 2023

ianks commented Mar 17, 2023

ianks commented Mar 20, 2023

tenderlove commented Mar 20, 2023

ianks commented Mar 20, 2023

tenderlove commented Mar 20, 2023

Ensure VM is running in signal handler #200

Ensure VM is running in signal handler #200

Conversation

ianks commented Mar 16, 2023

ianks commented Mar 16, 2023

ivoanjo commented Mar 16, 2023

peterzhu2118 commented Mar 16, 2023

ianks commented Mar 16, 2023

tenderlove commented Mar 16, 2023

ianks commented Mar 16, 2023

peterzhu2118 commented Mar 16, 2023

ianks commented Mar 17, 2023

ianks commented Mar 20, 2023

tenderlove commented Mar 20, 2023

ianks commented Mar 20, 2023

tenderlove commented Mar 20, 2023