roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

stwirth · 2015-03-10T13:18:16Z

We witnessed a crash of diagnostic_agregator/aggregator_node on one of our robots, that is what i could find:

Reading symbols from /opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node...(no debugging symbols found)...done.
[New LWP 2295]
[New LWP 2289]
[New LWP 6780]
[New LWP 2284]
[New LWP 2288]
[New LWP 2290]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node __name:=diagnostic_ag'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007f591962cb00 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() ()
   from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007f59196554f7 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007f5919656303 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007f59196825c4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007f5917bcba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007f59179aa182 in start_thread (arg=0x7f590bfff700) at pthread_create.c:312
#7  0x00007f59184f947d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Unfortunately that happened only once on one robot. No way to reproduce it. Has anybody seen something like that or can give me a hint on how to debug this further?

The text was updated successfully, but these errors were encountered:

rethink-imcmahon · 2015-04-30T19:22:19Z

+1 on this issue. I'm investigating a core with a very similar stack trace. It's incredibly difficult to reproduce, but I've managed to get two instances of the dump now:

Program terminated with signal 11, Segmentation fault.
(gdb) bt full
#0  atomic_conditional_increment (pw=0x3241) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:92
        rv = <optimized out>
#1  add_ref_lock (this=0x3239) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:139
No locals.
#2  shared_count (r=..., this=0x7f37ab7fdb18) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:589
No locals.
#3  boost::shared_ptr<ros::Subscription>::shared_ptr<ros::Subscription> (this=0x7f37ab7fdb10, r=...) at /usr/include/boost/smart_ptr/shared_ptr.hpp:414
No locals.
#4  0x00007f37b900b7b0 in lock (this=0x7f376c0275b8) at /usr/include/boost/smart_ptr/weak_ptr.hpp:178
No locals.
#5  ros::TransportPublisherLink::onRetryTimer (this=0x7f376c0275a0) at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/transport_publisher_link.cpp:204
        parent = {
          px = 0x0, 
          pn = {
            pi_ = 0x3239
          }
        }
        __PRETTY_FUNCTION__ = "void ros::TransportPublisherLink::onRetryTimer(const ros::WallTimerEvent&)"
#6  0x00007f37b8fc1e56 in operator() (a0=..., this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
No locals.
#7  ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call (this=0x7f371c000eb0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/include/ros/timer_manager.h:184
        tracked = {
          px = 0x0, 
          pn = {
            pi_ = 0x0
          }
        }
        event = {
          last_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727748844
            }, <No data fields>}, 
          last_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727984536
            }, <No data fields>}, 
          current_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 827748844
            }, <No data fields>}, 
          current_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 828038627
            }, <No data fields>}, 
          profile = {
            last_duration = {
              <ros::DurationBase<ros::WallDuration>> = {
                sec = 0, 
                nsec = 424
              }, <No data fields>}
          }
        }
#8  0x00007f37b8fea4ef in ros::CallbackQueue::callOneCB (this=this@entry=0x24eb900, tls=tls@entry=0x7f37a40008c0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:382
        rw_lock = {
          m = 0x7f371c0014f8, 
          is_locked = true
        }
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        info = {
          callback = {
            px = 0x7f371c000eb0, 
            pn = {
              pi_ = 0x7f371c000910
            }
          }, 
          removal_id = 139877086151072, 
          marked_for_removal = false
        }
        cb = @0x7f37ab7fdd00: {
          px = 0x7f371c000eb0, 
          pn = {
            pi_ = 0x7f371c000910
          }
        }
        id_info = {
          px = 0x7f371c0014f0, 
          pn = {
            pi_ = 0x7f371c000a80
          }
        }
#9  0x00007f37b8fea843 in ros::CallbackQueue::callAvailable (this=0x24eb900, timeout=...)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:333
        tls = <optimized out>
        called = 0
#10 0x00007f37b90162cc in ros::internalCallbackQueueThreadFunc () at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/init.cpp:282
        queue = {
          px = 0x24eb900, 
          pn = {
            pi_ = 0x24e7cd0
          }
        }
#11 0x00007f37bb4b594d in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:164
        thread_info = {
          px = 0x24eeac0, 
          pn = {
            pi_ = 0x24eb860
          }
        }
#12 0x00007f37b8ce41da in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f37b6e11d7d in clone () from /lib64/libc.so.6
No symbol table info available.

The only useful piece of information that I can provide in reproducing the dump is that it happens (sometimes) as I kill a node that published / subscribed to the same topics that the dumping node subscribed / published to respectively.

Poking around in frame 8 (died @ line 382 of /src/libros/callback_queue.cpp), I see that the TLS is not being marked_for_removal, which causes the cb->call() to get executed on a subscription that has already been destroyed:

  │362       CallbackInfo info = *tls->cb_it;
   ...   
  │375           if (info.marked_for_removal)      
  │376           {                     
  │377             tls->cb_it = tls->callbacks.erase(tls->cb_it);      
  │378           }    
  │379           else            
  │380           {             
  │381             tls->cb_it = tls->callbacks.erase(tls->cb_it);          
 >│382             result = cb->call();                        
  │383           }

Should / how would this callback be marked for removal?

(gdb) print info
$50 = {
  callback = {
    px = 0x7f371c000eb0, 
    pn = {
      pi_ = 0x7f371c000910
    }
  }, 
  removal_id = 139877086151072, 
  marked_for_removal = false
}

rethink-imcmahon · 2015-05-06T22:07:21Z

@dirk-thomas & @wjwwood, Have you seen any instances of this segfault at OSRF? I now have several instances from various ROS nodes, mostly on disconnect, but the core dumps are frustratingly hard to debug past a certain point (all with traces similar to the one shown above).

I'm trying to distill this crash down into a simple, reproducible example, so any thoughts or tips would be greatly appreciated.

wjwwood · 2015-05-06T23:08:02Z

@rethink-imcmahon I have not. I'll try to spend some time today or tomorrow looking at it.

rethink-imcmahon · 2015-05-07T21:12:18Z

Ok, I still don't have a nice reproducible example, but I think I have uncovered the issue.
For this block of offending code where the dumps always occur - https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L362-L383

Inside CallbackQueue::callOneCB() there is a very important info variable:

362 | CallbackInfo info = *tls->cb_it;
...
368 |    boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
...
375 |     if (info.marked_for_removal)
376 |     {
377 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
378 |     }
379 |     else
380 |     {
381 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
382 |        result = cb->call();
383 |     }

What seems odd here is that we are not referencing the original *tls->cb_it object, but making a copy of it! To make the reference properly, we need to add an &:

362 | CallbackInfo& info = *tls->cb_it;

Furthermore, if you look at another very similar block of code in the same file, you can see the reference is properly executed for a CallbackInfo object:
https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L248-L261

248 |      CallbackInfo& info = *it;
249 |
250 |      if (info.marked_for_removal)
251 |      {
252 |        it = callbacks_.erase(it);
253 |        continue;
254 |      }

Copy vs reference? It really shouldn't matter, should it? Well in the L362-383 block it does, since there is a shared_mutex lock in between the copy and the if (info.marked_for_removal). If the conditions are just right, by the time the mutex unlocks, the info copy is stale and false, despite the fact that the original object's marked_for_removal field would have been set to true. That would cause the result = cb->call() to execute on an invalid callback and coredump.

I'm wide open for critiques of this theory & ideas on a reliable method for reproducing this crash...

rethink-imcmahon · 2015-05-07T22:11:33Z

roscomm built fine with the changes that I proposed. Perhaps I'm just lucky, but I managed to get two coredumps in a row with modified talker / subscriber roscpp_tutorial nodes. Trying to reproduce the issue (and gain more insight into the original *tls->cb_it object), I added a couple copies to bring them into the local scope. Then rebuilt. Now the nodes are spinning up and down continuously every few seconds, so I'll post if I get another useful coredump.

dirk-thomas · 2015-05-08T05:19:02Z

I have created PR #615 to add the missing reference. Even if it is unclear if it helps solving this problem not copying the callback info at that point is the "right" thing to do. (Update: but as trivial as the patch looks like it doesn't seem to pass all unit tests)

Can you please share some information about your scenario where you get the crashes with the roscpp_tutorial talker / listener. Then others might be able to reproduce it the same way.

dirk-thomas · 2015-05-08T17:10:05Z

I want to summarize my thoughts when reading through the code - may be it helps others to trace the flow faster.

The callback iterators in the thread local storage (TLS) are the ones being already popped of the queue to be called within a specific thread without requiring locks on the queue itself.

There are four lines which use marked_for_removal beside the constructor of the info object:

ros_comm/clients/roscpp/src/libros/callback_queue.cpp

Line 201 in ea9ca68

info.marked_for_removal = true;
ros_comm/clients/roscpp/src/libros/callback_queue.cpp

Line 250 in ea9ca68

if (info.marked_for_removal)
ros_comm/clients/roscpp/src/libros/callback_queue.cpp

Line 375 in ea9ca68

if (info.marked_for_removal)
ros_comm/clients/roscpp/src/libros/callback_queue.cpp

Line 395 in ea9ca68

if (result == CallbackInterface::TryAgain && !info.marked_for_removal)

The second one checks the flag for callbacks in the queue and therefore does so with a locked mutex_.

The third and fouth check the flag for a callback in the TLS and therefore do so while having a lock of the calling_rw_mutex. The invocations of callOneCB happen from callAvailable which picks callbacks from the queue and puts them into the thread local storage (while holding mutex_). The invocation of each callback in the TLS is then done mutex-free (which is the idea of the TLS).

The first one actually marks all pending callbacks in the TLS for removal. The surrounding removeByID method holds no locks while setting these flags. This method is invoked from several places whenever callbacks are being removed / deregistered. This could happen asynchronously from a different thread as well as within another callback.

But since the TLS contains only thread local callbacks the block changing the flags (

ros_comm/clients/roscpp/src/libros/callback_queue.cpp

Lines 193 to 204 in ea9ca68

    
           { 
        
             D_CallbackInfo::iterator it = tls_->callbacks.begin(); 
        
             D_CallbackInfo::iterator end = tls_->callbacks.end(); 
        
             for (; it != end; ++it) 
        
             { 
        
               CallbackInfo& info = *it; 
        
               if (info.removal_id == removal_id) 
        
               { 
        
                 info.marked_for_removal = true; 
        
               } 
        
             } 
        
           }

) can only occur in case of an invocation from within another callback. Otherwise the TLS should not contain anything. Therefore I think it is perfectly safe to do this without additional locking.

rethink-imcmahon · 2015-05-08T19:17:56Z

Ok, thank you for the clarifications Dirk. I'm still trying to wrap my head around some of it, but the explanation is certainly helpful. Just brainstorming ways to debug this, I was thinking of adding an artificial wait around the lock to simulate a starved thread. I'll let you know if this line of thought turns up anything.

As for the talker / listener, I'll commit them to my fork and put a link here. They did segfault overnight but in another piece of initialization code. I'll look into that later, but for now this is still my focus.

rethink-imcmahon · 2015-05-08T19:21:22Z

What are the unit tests checking for exactly?

dirk-thomas · 2015-05-11T20:56:05Z

Are you referring to the unit tests failing for #615? I don't know what they are checking exactly. Will have to look closer into them but it seems they are failing due to the added reference.

rethink-imcmahon · 2015-06-24T13:56:47Z

@dirk-thomas or @stwirth, do you have any updates on this bug? I'd be curious to know why #615 unit tests were failing, and how we can fix up this patch

stwirth · 2015-06-24T14:05:52Z

@rethink-imcmahon I don't have updates. The bug never appeared again so I don't spend time on it.

dirk-thomas · 2015-06-24T17:20:53Z

Also no updates from me. I haven't spent any more time on this since the last comment. I don't expect to have time for looking further into it and therefore tried to summarize my investigation as good as possible to enable others to continue looking into it.

jasonimercer · 2015-07-06T20:44:49Z

I've observed this crash but I don't have anything meaningful to add:

Core was generated by `/home/administrator/robot_ws/devel/lib/move_base/move_base map:=slam/map cmd_ve'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007ff444fe7b30 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() () from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007ff445010777 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007ff445011583 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007ff44503d8f4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007ff442acfa4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007ff4428ae182 in start_thread (arg=0x7ff43b37e700) at pthread_create.c:312
#7  0x00007ff44433a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

rethink-imcmahon · 2015-08-13T22:03:07Z

We have many virtual machines running a wide variety of ROS tasks nightly, so this issue is causing a handful of coredumps every night. In trying to see why the fix I proposed in May didn't work, I added the "reference-instead-of-copy" patch locally. Eventually, I managed to make it core dump:

(gdb) bt full
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
        rw_lock = {m = 0x7f0ddc000938, is_locked = true}
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        __PRETTY_FUNCTION__ = "ros::CallbackQueue::CallOneResult ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)"
        info = @0x7f0e28001910: {callback = {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}, 
          removal_id = 139698911842016, marked_for_removal = false}
        cb = @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}
        id_info = {px = 0x7f0ddc000930, pn = {pi_ = 0x7f0ddc000a80}}
#1  0x00007f0e3c5af921 in ros::CallbackQueue::callAvailable (this=0x2129460, timeout=...)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:333
        tls = 0x7f0e280008c0
        called = 0
#2  0x00007f0e3c5f5160 in ros::internalCallbackQueueThreadFunc ()
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/init.cpp:282
        queue = {px = 0x2129460, pn = {pi_ = 0x21289c0}}
#3  0x00007f0e3c6006c9 in boost::detail::thread_data<void (*)()>::run (this=0x2129cb0)
    at /usr/include/boost/thread/detail/thread.hpp:117
No locals.
#4  0x00007f0e3ab79a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
No symbol table info available.
#5  0x00007f0e3a958182 in start_thread (arg=0x7f0e34b86700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f0e34b86700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139698990769920, -5890305797539832903, 0, 0, 139698990770624, 
                139698990769920, 5790792828027591609, 5790814652555123641}, mask_was_saved = 0}}, priv = {pad = {0x0, 
              0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007f0e3b6f347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 0
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
382         result = cb->call();
(gdb) print cb
$1 = (ros::CallbackInterfacePtr &) @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}

This coredump was quite hard to produce, but after hundreds of subscribers were spun up / SIGINT, I managed to get what you see here. So it seems that before, when we were copying CallbackInfo rather than grabbing a reference to it, we were seeing the partial execution of the callback (onRetryTimer). Now, it appears the underlying result = cb->call() itself is no longer valid. I think this has to do with the pointer to the subscriber being destroyed, causing the dereference to fail.
Before this patch, when callback_queue held a copy of CallbackInfo info, we could last a little longer and end up in more dangerous territory by trying to utilize a destroyed weak_ptr. According to the documentation, weak_ptr operations never throw exceptions. roscpp must have been doing something wrong then to cause the weak_ptr to throw an exception down in the assembly code as we see in the original stack trace. I looked some plausible causes for this: threads being destroyed while references to their callbacks still exist, or possibly this - the vector of subscriptions being swapped out during connection dropping, leaving a null pointer to the subscription vector in its place:

ros_comm/clients/roscpp/src/libros/subscription.cpp

Lines 158 to 176 in ace5b9c

    
           void Subscription::dropAllConnections() 
        
           { 
        
             // Swap our subscribers list with a local one so we can only lock for a short period of time, because a 
        
             // side effect of our calling drop() on connections can be re-locking the subscribers mutex 
        
             V_PublisherLink localsubscribers; 
        
             { 
        
               boost::mutex::scoped_lock lock(publisher_links_mutex_); 
        
               localsubscribers.swap(publisher_links_); 
        
             } 
        
             V_PublisherLink::iterator it = localsubscribers.begin(); 
        
             V_PublisherLink::iterator end = localsubscribers.end(); 
        
             for (;it != end; ++it) 
        
             { 
        
               (*it)->drop(); 
        
             } 
        
           }

Despite the new core dump, think I still prefer having the reference for CallbackInfo rather than the direct copy, since having the copy lead us into a weird assembly code state, and the reference gives us a null cb->call() faster (which can be handled appropriately). I do feel like I am dancing around the underlying issue, so if anyone can give a more satisfying explanation of these errors, I'd be happy to hear it.

rethink-imcmahon · 2015-08-20T19:51:33Z

The head of our QA department sent me some stats on this bug today. We run an extensive suite of tests on a nightly basis with many virtual robots in the cloud. In just the virtual robots (not counting the instances of this bug that we find on our real hardware) he finds in the past month:

Total count of core dumps since 7/19/15: 56
Total count of test runs: 59
Total count of test suites per run: 40
On average, each suite takes 36 min.

So we saw 56 instances of the roscpp bug in total of 2360 newly cleaned robots, each of them running heavily for 36 min.

It seems every ROS subscriber is susceptible to the crash, and the nodes that create-destroy subscriptions regularly are the worst offenders. I am still actively hunting this bug down, but I would welcome any help from others in the community. @wjwwood @dirk-thomas @jacquelinekay any thoughts on how to raise awareness on this issue in the community?

wjwwood · 2015-08-20T22:47:39Z

@rethink-imcmahon I'd like to have a crack at it, but realistically I'm going to be too busy for a few weeks. Maybe you can just send an email to ros-users@ explaining that several people have taken some time to look at the problem but it eludes us. Someone might take it as a challenge and find a problem that we have overlooked.

jacquelinekay · 2015-08-21T01:34:23Z

@rethink-imcmahon here are a few ideas for creating a test to reliably reproduce this crash (since it sounds like it happens randomly):

The fact that it's difficult to reproduce but happens more frequently when there are many subscribers makes me wonder if it happens when there's a scheduling anomaly. Can you try adding a sleep right before line 382 of callback_queue.cpp?

Or, to confirm or deny your theory about the subscriptions dropping, you could make a test where one thread spins doing some pub/sub work, and another thread causes all subscriptions to drop.

jeffadamsc · 2015-08-21T19:31:09Z

I made a little progress on this today. @jacquelinekay, thanks for the tip. I added the line "usleep(100);" before line 382 of callback_queue.cpp and that did help to reproduce the bug more often. I also modified a simple test by @rethink-imcmahon as follows and I can reproduce the bug about every minute:


#include "ros/ros.h"
#include "std_msgs/String.h"
#include <sstream>
#include <boost/thread.hpp>


void chatterCallback(const std_msgs::String::ConstPtr& msg)
{
  ROS_INFO("I heard: [%s]", msg->data.c_str());
}

void subscribeDestroy() {
  ros::Time start_time;
  double rollover_time = 5.0; // seconds
  size_t num_subs = 100;
  std_msgs::String msg;
  msg.data = std::string("hello world");
  ros::NodeHandle n;
  while(ros::ok())
  {
    start_time = ros::Time::now();
    std::vector<ros::Subscriber> subs;
    for( size_t i = 0; i < num_subs; i++ )
    {
      subs.push_back(n.subscribe("chatter", 100, chatterCallback));
    }
    usleep(500000);
    // subscribers destructors called here
    std::cout<<"Rolling over! Calling dtors now. "<<"Total time: "<<(ros::Time::now() - start_time).toSec()<<std::endl;
  }
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "listener");
  boost::thread first (subscribeDestroy); 
  ros::spin();

  return 0;
}

I've started working on solutions to the problem, but no big revelations yet.

This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. There were multiple potential causes this issue, and after investigating several, it was determined that the onRetryTimer function was attempting to access its parent Subscription instance during CallbackQueue execution. This could lead to a scenario where invalid memory is accessed if all of the other Subscription resources had been destroyed in another thread. In that rare circumstance, the partial execution of the onRetryTimer callback attempted to access the destroyed Subscription parent_, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() inside the CallbackQueue::callOneCB function, which makes this a time-dependant thread scheduling issue. To solve this issue, some of the logic in onRetryTimer had to be reworked to avoid accessing the parent Subscription in case it has been destroyed. Thought was given to mutex locking the parent as it was accessed, but this resulted in the Subscription class being destroyed while its member mutex was locked in onRetryTimer inside the CallbackQueue thread, causing a pthread mutex error. This occurs since mutexes cannot be destroyed if they are locked. Instead, the need for the parent Subscription access in onRetryTimer was removed entirely. The function onRetryTimer() accesses its parent Subscription to attempt to get the name of the topic that needs to be reconnected in the event of a disconnect and add that name to a log message. We simply removed the topic name from the log messages. The second problematic parent instance access occurred to determine if the Transport type is TCPROS. If it was anything else (say UDP), the publisher link is dropped. Checking the type of Transport connection was moved into the main Subscription thread via the onConnectionDropped() method, so the check can happen even before the onRetryTimer method is bound to the CallbackQueue. UDP retry connection dropping should function exactly the same as before this patch. The end result of this investigation and patch is less of a likelihood for invalid pieces of memory to be accessed during normal ROS callback execution.

This issue manifested with periodic coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. The underlying issue was a copy of CallbackInfo being created in callback_queue.cpp. By creating a copy of this object, its shared pointer reference count for CallbackInterface was incremented, preventing destruction of the CallbackInterface by an external thread, even if all of the other Subscription resources had been destroyed. This lead to partial execution of the onRetryTimer callback, until it attempted to access the destroyed Subscription parent, causing an invalid memory access, and coredump. This issue only occurs if the Subscription is destroyed while the callback_queue is blocked by attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock() which would make this a time-dependant thread scheduling issue. To remedy the the invalid callback, a reference to the CallbackInfo object in Thread Local Storage (tls) is made rather than a copy. This prevents the CallbackInterface shared pointer reference count from being incremented (since only the reference is taken). To access the callback for execution, a weak pointer is made from the CallbackInterface. Then after the mutex lock is acquired, the weak pointer is used to create a shared pointer for callback execution. If the weak pointer fails to return a shared pointer, the CallbackInterface has been destroyed. In that case, the callback is Invalid and execution is prevented.

@jeffadamsc

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned out the issue was caused by a few functions that were attached to boost::signals2::signals. These functions did not have shared_ptr's to the classes that they were bound to, causing a coredump if the original classes were destructed. This was particularly problematic if the class was destroyed while the signal function was in the middle of execution. Due to a mutex locking strategy in CallbackQueue::CallOneCB() causing semi- random thread execution timing, this issue was only seen on occasion. The solution to this bug was to rework how DropSignals were implemented. Additional function parameters for tracked_objects were added to for dropping connections in order to allow for an object's shared_from_this shared pointer to be passed into the function that was boost::bind'ed to a signal2::signal. This prevented the destruction of a Subscriber from destroying the objects which thse Drop functions were acting upon. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurance of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps relating to this issue have been discovered over the course of a week of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs.

@jeffadamsc

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management

Re #577: Signal slot & onRetryTimer callback fix

rethink-imcmahon · 2015-10-15T13:53:29Z

@dirk-thomas I think it's safe to close this issue now that #677 is merged. Thanks!

stwirth · 2015-10-15T13:58:18Z

Awesome @rethink-imcmahon! Thanks for all the effort!

jasonimercer · 2015-10-29T14:57:23Z

Is there an ETA on when this fix will transition to the public debs? We're having lots of problems with this on the VMs here at ClearPath.

dirk-thomas · 2015-10-29T19:02:24Z

The changes in shadow-fixed have been announced six days ago (http://lists.ros.org/pipermail/ros-release/2015-October/004708.html) and we are waiting for people to test them thoroughly during that soak period. Probably within a week from now it will be synced to public.

@jeffadamsc

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management

jasonimercer · 2015-11-12T13:30:33Z

We've put the shadow build of ros_comm 1.11.15-0 on our internal apt servers here at ClearPath and are running it without incident.

dirk-thomas · 2015-11-12T16:19:21Z

Thanks for reporting back.

We will sync it once the shadow-fixed repo has no regressions. Currently 1.11.16 is being built for a different reason.

@jeffadamsc

This issue manifested with random coredumps in the onRetryTimer() callback in transport_publisher_link.cpp. These crashes were infrequent, and extremely difficult to reproduce. They appeared to happen when many Subscriptions were being destroyed, which made it likely to be a destruction scheduling anomaly. It turned there were two issues at play: the onRetryTimer function did not having a valid reference to the TransportPublusherLink class, and the onConnectionDropped function was not registered to the boost::signals2::signal in such a way that tracked the existence of the underlying ROS Subscriber connection. The first issue (which was uncovered in ros#577) was resolved by calling getInternalTimerManger()::add with shared_from_this instead of an empty shared pointer. Testing this fix then revealed the second issue. Initially, the onConnectionDropped seemed to be fixed by keeping track of the TransportPublisherLink class with a shared_from_this pointer, though it did not pass all of the ros_comm tests. However, changing anything this deep in ROS runs the risk of unintended consequences, and upon further testing, we discovered that ros_controls/realtime_tools::RealTimePublishers were not being properly notified when connections were dropped, and entering deadlock conditions. Taking inspiration from the boost::signals2 Automatic Connection Management tutorial [1], we then used signals2::slot_type and track() mechanisms in order to maintain a reference to the TransportPublisherLink class during execution of the onConnectionDropped() code. This was functionally similar to the previous approach, with the added benefit that boost::signals2 handles all shared_from_this() reference counting under the hood, without preventing destruction of the parent class (which we believe to be the source of the unintended RealTimePublisher deadlocking). Upon the tracked object's destruction, the signal2 library automatically disconnects the signal. Reproduction of this issue was incredibly difficult, since it required the timing of the class destruction to perfectly coincide with the execution of the callback function using that class. We found this could only be reliably reproduced when running overnight simulations of hundreds of Subscribers by Rethink Robotics Automated QA team. The occurrence of this bug was producing roughly 1 to 5 coredumps per night in this testing environment. Since this patch was applied, zero coredumps and RealTimePublisher deadlocks have been seen over the course of a several days of testing. Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this callback connection strategy, and for QA allowing us to run these tests on their nightly runs. [1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management

Re ros#577: Signal slot & onRetryTimer callback fix

dirk-thomas added a commit that referenced this issue May 8, 2015

fix using callback info reference instead of copy (#577)

b4280a8

dirk-thomas mentioned this issue May 8, 2015

fix using callback info reference instead of copy #615

Closed

rethink-imcmahon mentioned this issue Sep 15, 2015

Prevent Invalid callback execution. Resolves #577 #654

Closed

rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Sep 16, 2015

Possible solution to ros#577 by mutex locking dropping_ flag

e7327a4

rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Sep 16, 2015

Possible solution to ros#577 by mutex locking dropping_ flag

571e7ea

rethink-imcmahon mentioned this issue Oct 12, 2015

Re #577: Signal slot & onRetryTimer callback fix #677

Merged

dirk-thomas added a commit that referenced this issue Oct 13, 2015

Merge pull request #677 from rethink-imcmahon/retry-shared-from-this

f32a4f5

Re #577: Signal slot & onRetryTimer callback fix

dirk-thomas closed this as completed Oct 15, 2015

chris-smith mentioned this issue Apr 27, 2016

Rename rosjs RethinkRobotics-opensource/rosnodejs#7

Closed

rsinnet pushed a commit to MisoRobotics/ros_comm that referenced this issue Jun 19, 2017

Merge pull request ros#677 from rethink-imcmahon/retry-shared-from-this

fa673da

Re ros#577: Signal slot & onRetryTimer callback fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

stwirth commented Mar 10, 2015

rethink-imcmahon commented Apr 30, 2015

rethink-imcmahon commented May 6, 2015

wjwwood commented May 6, 2015

rethink-imcmahon commented May 7, 2015

rethink-imcmahon commented May 7, 2015

dirk-thomas commented May 8, 2015

dirk-thomas commented May 8, 2015

rethink-imcmahon commented May 8, 2015

rethink-imcmahon commented May 8, 2015

dirk-thomas commented May 11, 2015

rethink-imcmahon commented Jun 24, 2015

stwirth commented Jun 24, 2015

dirk-thomas commented Jun 24, 2015

jasonimercer commented Jul 6, 2015

rethink-imcmahon commented Aug 13, 2015

rethink-imcmahon commented Aug 20, 2015

wjwwood commented Aug 20, 2015

jacquelinekay commented Aug 21, 2015

jeffadamsc commented Aug 21, 2015

rethink-imcmahon commented Oct 15, 2015

stwirth commented Oct 15, 2015

jasonimercer commented Oct 29, 2015

dirk-thomas commented Oct 29, 2015

jasonimercer commented Nov 12, 2015

dirk-thomas commented Nov 12, 2015

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

Comments

stwirth commented Mar 10, 2015

rethink-imcmahon commented Apr 30, 2015

rethink-imcmahon commented May 6, 2015

wjwwood commented May 6, 2015

rethink-imcmahon commented May 7, 2015

rethink-imcmahon commented May 7, 2015

dirk-thomas commented May 8, 2015

dirk-thomas commented May 8, 2015

rethink-imcmahon commented May 8, 2015

rethink-imcmahon commented May 8, 2015

dirk-thomas commented May 11, 2015

rethink-imcmahon commented Jun 24, 2015

stwirth commented Jun 24, 2015

dirk-thomas commented Jun 24, 2015

jasonimercer commented Jul 6, 2015

rethink-imcmahon commented Aug 13, 2015

rethink-imcmahon commented Aug 20, 2015

wjwwood commented Aug 20, 2015

jacquelinekay commented Aug 21, 2015

jeffadamsc commented Aug 21, 2015

rethink-imcmahon commented Oct 15, 2015

stwirth commented Oct 15, 2015

jasonimercer commented Oct 29, 2015

dirk-thomas commented Oct 29, 2015

jasonimercer commented Nov 12, 2015

dirk-thomas commented Nov 12, 2015