Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

Closed
stwirth opened this issue Mar 10, 2015 · 25 comments
Closed

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

stwirth opened this issue Mar 10, 2015 · 25 comments

Comments

@stwirth
Copy link

stwirth commented Mar 10, 2015

We witnessed a crash of diagnostic_agregator/aggregator_node on one of our robots, that is what i could find:

Reading symbols from /opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node...(no debugging symbols found)...done.
[New LWP 2295]
[New LWP 2289]
[New LWP 6780]
[New LWP 2284]
[New LWP 2288]
[New LWP 2290]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node __name:=diagnostic_ag'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007f591962cb00 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() ()
   from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007f59196554f7 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007f5919656303 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007f59196825c4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007f5917bcba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007f59179aa182 in start_thread (arg=0x7f590bfff700) at pthread_create.c:312
#7  0x00007f59184f947d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Unfortunately that happened only once on one robot. No way to reproduce it. Has anybody seen something like that or can give me a hint on how to debug this further?

@rethink-imcmahon
Copy link

+1 on this issue. I'm investigating a core with a very similar stack trace. It's incredibly difficult to reproduce, but I've managed to get two instances of the dump now:

Program terminated with signal 11, Segmentation fault.
(gdb) bt full
#0  atomic_conditional_increment (pw=0x3241) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:92
        rv = <optimized out>
#1  add_ref_lock (this=0x3239) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:139
No locals.
#2  shared_count (r=..., this=0x7f37ab7fdb18) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:589
No locals.
#3  boost::shared_ptr<ros::Subscription>::shared_ptr<ros::Subscription> (this=0x7f37ab7fdb10, r=...) at /usr/include/boost/smart_ptr/shared_ptr.hpp:414
No locals.
#4  0x00007f37b900b7b0 in lock (this=0x7f376c0275b8) at /usr/include/boost/smart_ptr/weak_ptr.hpp:178
No locals.
#5  ros::TransportPublisherLink::onRetryTimer (this=0x7f376c0275a0) at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/transport_publisher_link.cpp:204
        parent = {
          px = 0x0, 
          pn = {
            pi_ = 0x3239
          }
        }
        __PRETTY_FUNCTION__ = "void ros::TransportPublisherLink::onRetryTimer(const ros::WallTimerEvent&)"
#6  0x00007f37b8fc1e56 in operator() (a0=..., this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
No locals.
#7  ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call (this=0x7f371c000eb0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/include/ros/timer_manager.h:184
        tracked = {
          px = 0x0, 
          pn = {
            pi_ = 0x0
          }
        }
        event = {
          last_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727748844
            }, <No data fields>}, 
          last_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727984536
            }, <No data fields>}, 
          current_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 827748844
            }, <No data fields>}, 
          current_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 828038627
            }, <No data fields>}, 
          profile = {
            last_duration = {
              <ros::DurationBase<ros::WallDuration>> = {
                sec = 0, 
                nsec = 424
              }, <No data fields>}
          }
        }
#8  0x00007f37b8fea4ef in ros::CallbackQueue::callOneCB (this=this@entry=0x24eb900, tls=tls@entry=0x7f37a40008c0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:382
        rw_lock = {
          m = 0x7f371c0014f8, 
          is_locked = true
        }
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        info = {
          callback = {
            px = 0x7f371c000eb0, 
            pn = {
              pi_ = 0x7f371c000910
            }
          }, 
          removal_id = 139877086151072, 
          marked_for_removal = false
        }
        cb = @0x7f37ab7fdd00: {
          px = 0x7f371c000eb0, 
          pn = {
            pi_ = 0x7f371c000910
          }
        }
        id_info = {
          px = 0x7f371c0014f0, 
          pn = {
            pi_ = 0x7f371c000a80
          }
        }
#9  0x00007f37b8fea843 in ros::CallbackQueue::callAvailable (this=0x24eb900, timeout=...)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:333
        tls = <optimized out>
        called = 0
#10 0x00007f37b90162cc in ros::internalCallbackQueueThreadFunc () at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/init.cpp:282
        queue = {
          px = 0x24eb900, 
          pn = {
            pi_ = 0x24e7cd0
          }
        }
#11 0x00007f37bb4b594d in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:164
        thread_info = {
          px = 0x24eeac0, 
          pn = {
            pi_ = 0x24eb860
          }
        }
#12 0x00007f37b8ce41da in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f37b6e11d7d in clone () from /lib64/libc.so.6
No symbol table info available.

The only useful piece of information that I can provide in reproducing the dump is that it happens (sometimes) as I kill a node that published / subscribed to the same topics that the dumping node subscribed / published to respectively.

Poking around in frame 8 (died @ line 382 of /src/libros/callback_queue.cpp), I see that the TLS is not being marked_for_removal, which causes the cb->call() to get executed on a subscription that has already been destroyed:

  │362       CallbackInfo info = *tls->cb_it;
   ...   
  │375           if (info.marked_for_removal)      
  │376           {                     
  │377             tls->cb_it = tls->callbacks.erase(tls->cb_it);      
  │378           }    
  │379           else            
  │380           {             
  │381             tls->cb_it = tls->callbacks.erase(tls->cb_it);          
 >│382             result = cb->call();                        
  │383           }         

Should / how would this callback be marked for removal?

(gdb) print info
$50 = {
  callback = {
    px = 0x7f371c000eb0, 
    pn = {
      pi_ = 0x7f371c000910
    }
  }, 
  removal_id = 139877086151072, 
  marked_for_removal = false
}

@rethink-imcmahon
Copy link

@dirk-thomas & @wjwwood, Have you seen any instances of this segfault at OSRF? I now have several instances from various ROS nodes, mostly on disconnect, but the core dumps are frustratingly hard to debug past a certain point (all with traces similar to the one shown above).

I'm trying to distill this crash down into a simple, reproducible example, so any thoughts or tips would be greatly appreciated.

@wjwwood
Copy link
Member

wjwwood commented May 6, 2015

@rethink-imcmahon I have not. I'll try to spend some time today or tomorrow looking at it.

@rethink-imcmahon
Copy link

Ok, I still don't have a nice reproducible example, but I think I have uncovered the issue.
For this block of offending code where the dumps always occur - https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L362-L383

Inside CallbackQueue::callOneCB() there is a very important info variable:

362 | CallbackInfo info = *tls->cb_it;
...
368 |    boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
...
375 |     if (info.marked_for_removal)
376 |     {
377 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
378 |     }
379 |     else
380 |     {
381 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
382 |        result = cb->call();
383 |     }

What seems odd here is that we are not referencing the original *tls->cb_it object, but making a copy of it! To make the reference properly, we need to add an &:

362 | CallbackInfo& info = *tls->cb_it;

Furthermore, if you look at another very similar block of code in the same file, you can see the reference is properly executed for a CallbackInfo object:
https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L248-L261

248 |      CallbackInfo& info = *it;
249 |
250 |      if (info.marked_for_removal)
251 |      {
252 |        it = callbacks_.erase(it);
253 |        continue;
254 |      }

Copy vs reference? It really shouldn't matter, should it? Well in the L362-383 block it does, since there is a shared_mutex lock in between the copy and the if (info.marked_for_removal). If the conditions are just right, by the time the mutex unlocks, the info copy is stale and false, despite the fact that the original object's marked_for_removal field would have been set to true. That would cause the result = cb->call() to execute on an invalid callback and coredump.

I'm wide open for critiques of this theory & ideas on a reliable method for reproducing this crash...

@rethink-imcmahon
Copy link

roscomm built fine with the changes that I proposed. Perhaps I'm just lucky, but I managed to get two coredumps in a row with modified talker / subscriber roscpp_tutorial nodes. Trying to reproduce the issue (and gain more insight into the original *tls->cb_it object), I added a couple copies to bring them into the local scope. Then rebuilt. Now the nodes are spinning up and down continuously every few seconds, so I'll post if I get another useful coredump.

@dirk-thomas
Copy link
Member

I have created PR #615 to add the missing reference. Even if it is unclear if it helps solving this problem not copying the callback info at that point is the "right" thing to do. (Update: but as trivial as the patch looks like it doesn't seem to pass all unit tests)

Can you please share some information about your scenario where you get the crashes with the roscpp_tutorial talker / listener. Then others might be able to reproduce it the same way.

@dirk-thomas
Copy link
Member

I want to summarize my thoughts when reading through the code - may be it helps others to trace the flow faster.

The callback iterators in the thread local storage (TLS) are the ones being already popped of the queue to be called within a specific thread without requiring locks on the queue itself.

There are four lines which use marked_for_removal beside the constructor of the info object:

The second one checks the flag for callbacks in the queue and therefore does so with a locked mutex_.

The third and fouth check the flag for a callback in the TLS and therefore do so while having a lock of the calling_rw_mutex. The invocations of callOneCB happen from callAvailable which picks callbacks from the queue and puts them into the thread local storage (while holding mutex_). The invocation of each callback in the TLS is then done mutex-free (which is the idea of the TLS).

The first one actually marks all pending callbacks in the TLS for removal. The surrounding removeByID method holds no locks while setting these flags. This method is invoked from several places whenever callbacks are being removed / deregistered. This could happen asynchronously from a different thread as well as within another callback.

But since the TLS contains only thread local callbacks the block changing the flags (

{
D_CallbackInfo::iterator it = tls_->callbacks.begin();
D_CallbackInfo::iterator end = tls_->callbacks.end();
for (; it != end; ++it)
{
CallbackInfo& info = *it;
if (info.removal_id == removal_id)
{
info.marked_for_removal = true;
}
}
}
) can only occur in case of an invocation from within another callback. Otherwise the TLS should not contain anything. Therefore I think it is perfectly safe to do this without additional locking.

@rethink-imcmahon
Copy link

Ok, thank you for the clarifications Dirk. I'm still trying to wrap my head around some of it, but the explanation is certainly helpful. Just brainstorming ways to debug this, I was thinking of adding an artificial wait around the lock to simulate a starved thread. I'll let you know if this line of thought turns up anything.

As for the talker / listener, I'll commit them to my fork and put a link here. They did segfault overnight but in another piece of initialization code. I'll look into that later, but for now this is still my focus.

@rethink-imcmahon
Copy link

What are the unit tests checking for exactly?

@dirk-thomas
Copy link
Member

Are you referring to the unit tests failing for #615? I don't know what they are checking exactly. Will have to look closer into them but it seems they are failing due to the added reference.

@rethink-imcmahon
Copy link

@dirk-thomas or @stwirth, do you have any updates on this bug? I'd be curious to know why #615 unit tests were failing, and how we can fix up this patch

@stwirth
Copy link
Author

stwirth commented Jun 24, 2015

@rethink-imcmahon I don't have updates. The bug never appeared again so I don't spend time on it.

@dirk-thomas
Copy link
Member

Also no updates from me. I haven't spent any more time on this since the last comment. I don't expect to have time for looking further into it and therefore tried to summarize my investigation as good as possible to enable others to continue looking into it.

@jasonimercer
Copy link
Contributor

I've observed this crash but I don't have anything meaningful to add:

Core was generated by `/home/administrator/robot_ws/devel/lib/move_base/move_base map:=slam/map cmd_ve'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007ff444fe7b30 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() () from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007ff445010777 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007ff445011583 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007ff44503d8f4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007ff442acfa4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007ff4428ae182 in start_thread (arg=0x7ff43b37e700) at pthread_create.c:312
#7  0x00007ff44433a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@rethink-imcmahon
Copy link

We have many virtual machines running a wide variety of ROS tasks nightly, so this issue is causing a handful of coredumps every night. In trying to see why the fix I proposed in May didn't work, I added the "reference-instead-of-copy" patch locally. Eventually, I managed to make it core dump:

(gdb) bt full
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
        rw_lock = {m = 0x7f0ddc000938, is_locked = true}
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        __PRETTY_FUNCTION__ = "ros::CallbackQueue::CallOneResult ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)"
        info = @0x7f0e28001910: {callback = {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}, 
          removal_id = 139698911842016, marked_for_removal = false}
        cb = @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}
        id_info = {px = 0x7f0ddc000930, pn = {pi_ = 0x7f0ddc000a80}}
#1  0x00007f0e3c5af921 in ros::CallbackQueue::callAvailable (this=0x2129460, timeout=...)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:333
        tls = 0x7f0e280008c0
        called = 0
#2  0x00007f0e3c5f5160 in ros::internalCallbackQueueThreadFunc ()
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/init.cpp:282
        queue = {px = 0x2129460, pn = {pi_ = 0x21289c0}}
#3  0x00007f0e3c6006c9 in boost::detail::thread_data<void (*)()>::run (this=0x2129cb0)
    at /usr/include/boost/thread/detail/thread.hpp:117
No locals.
#4  0x00007f0e3ab79a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
No symbol table info available.
#5  0x00007f0e3a958182 in start_thread (arg=0x7f0e34b86700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f0e34b86700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139698990769920, -5890305797539832903, 0, 0, 139698990770624, 
                139698990769920, 5790792828027591609, 5790814652555123641}, mask_was_saved = 0}}, priv = {pad = {0x0, 
              0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007f0e3b6f347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 0
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
382         result = cb->call();
(gdb) print cb
$1 = (ros::CallbackInterfacePtr &) @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}

This coredump was quite hard to produce, but after hundreds of subscribers were spun up / SIGINT, I managed to get what you see here. So it seems that before, when we were copying CallbackInfo rather than grabbing a reference to it, we were seeing the partial execution of the callback (onRetryTimer). Now, it appears the underlying result = cb->call() itself is no longer valid. I think this has to do with the pointer to the subscriber being destroyed, causing the dereference to fail.
Before this patch, when callback_queue held a copy of CallbackInfo info, we could last a little longer and end up in more dangerous territory by trying to utilize a destroyed weak_ptr. According to the documentation, weak_ptr operations never throw exceptions. roscpp must have been doing something wrong then to cause the weak_ptr to throw an exception down in the assembly code as we see in the original stack trace. I looked some plausible causes for this: threads being destroyed while references to their callbacks still exist, or possibly this - the vector of subscriptions being swapped out during connection dropping, leaving a null pointer to the subscription vector in its place:

void Subscription::dropAllConnections()
{
// Swap our subscribers list with a local one so we can only lock for a short period of time, because a
// side effect of our calling drop() on connections can be re-locking the subscribers mutex
V_PublisherLink localsubscribers;
{
boost::mutex::scoped_lock lock(publisher_links_mutex_);
localsubscribers.swap(publisher_links_);
}
V_PublisherLink::iterator it = localsubscribers.begin();
V_PublisherLink::iterator end = localsubscribers.end();
for (;it != end; ++it)
{
(*it)->drop();
}
}

Despite the new core dump, think I still prefer having the reference for CallbackInfo rather than the direct copy, since having the copy lead us into a weird assembly code state, and the reference gives us a null cb->call() faster (which can be handled appropriately). I do feel like I am dancing around the underlying issue, so if anyone can give a more satisfying explanation of these errors, I'd be happy to hear it.

@rethink-imcmahon
Copy link

The head of our QA department sent me some stats on this bug today. We run an extensive suite of tests on a nightly basis with many virtual robots in the cloud. In just the virtual robots (not counting the instances of this bug that we find on our real hardware) he finds in the past month:

Total count of core dumps since 7/19/15: 56
Total count of test runs: 59
Total count of test suites per run: 40
On average, each suite takes 36 min.

So we saw 56 instances of the roscpp bug in total of 2360 newly cleaned robots, each of them running heavily for 36 min. 

It seems every ROS subscriber is susceptible to the crash, and the nodes that create-destroy subscriptions regularly are the worst offenders. I am still actively hunting this bug down, but I would welcome any help from others in the community. @wjwwood @dirk-thomas @jacquelinekay any thoughts on how to raise awareness on this issue in the community?

@wjwwood
Copy link
Member

wjwwood commented Aug 20, 2015

@rethink-imcmahon I'd like to have a crack at it, but realistically I'm going to be too busy for a few weeks. Maybe you can just send an email to ros-users@ explaining that several people have taken some time to look at the problem but it eludes us. Someone might take it as a challenge and find a problem that we have overlooked.

@jacquelinekay
Copy link

@rethink-imcmahon here are a few ideas for creating a test to reliably reproduce this crash (since it sounds like it happens randomly):

The fact that it's difficult to reproduce but happens more frequently when there are many subscribers makes me wonder if it happens when there's a scheduling anomaly. Can you try adding a sleep right before line 382 of callback_queue.cpp?

Or, to confirm or deny your theory about the subscriptions dropping, you could make a test where one thread spins doing some pub/sub work, and another thread causes all subscriptions to drop.

@jeffadamsc
Copy link

I made a little progress on this today. @jacquelinekay, thanks for the tip. I added the line "usleep(100);" before line 382 of callback_queue.cpp and that did help to reproduce the bug more often. I also modified a simple test by @rethink-imcmahon as follows and I can reproduce the bug about every minute:


#include "ros/ros.h"
#include "std_msgs/String.h"
#include <sstream>
#include <boost/thread.hpp>


void chatterCallback(const std_msgs::String::ConstPtr& msg)
{
  ROS_INFO("I heard: [%s]", msg->data.c_str());
}

void subscribeDestroy() {
  ros::Time start_time;
  double rollover_time = 5.0; // seconds
  size_t num_subs = 100;
  std_msgs::String msg;
  msg.data = std::string("hello world");
  ros::NodeHandle n;
  while(ros::ok())
  {
    start_time = ros::Time::now();
    std::vector<ros::Subscriber> subs;
    for( size_t i = 0; i < num_subs; i++ )
    {
      subs.push_back(n.subscribe("chatter", 100, chatterCallback));
    }
    usleep(500000);
    // subscribers destructors called here
    std::cout<<"Rolling over! Calling dtors now. "<<"Total time: "<<(ros::Time::now() - start_time).toSec()<<std::endl;
  }
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "listener");
  boost::thread first (subscribeDestroy); 
  ros::spin();

  return 0;
}

I've started working on solutions to the problem, but no big revelations yet.

rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Aug 25, 2015
This issue manifested with periodic coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

The underlying issue was a copy of CallbackInfo being created in
callback_queue.cpp. By creating a copy of this object, its shared
pointer reference count for CallbackInterface was incremented,
preventing destruction of the CallbackInterface by an external thread,
even if all of the other Subscription resources had been destroyed.
This lead to partial execution of the onRetryTimer callback, until
it attempted to access the destroyed Subscription parent, causing an
invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.

To remedy the the invalid callback, a reference to the CallbackInfo
object in Thread Local Storage (tls) is made rather than a copy. This
prevents the CallbackInterface shared pointer reference count from being
incremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the CallbackInterface. Then
after the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the CallbackInterface has been destroyed.
In that case, the callback is Invalid and execution is prevented.
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Aug 25, 2015
This issue manifested with periodic coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

The underlying issue was a copy of CallbackInfo being created in
callback_queue.cpp. By creating a copy of this object, its shared
pointer reference count for CallbackInterface was incremented,
preventing destruction of the CallbackInterface by an external thread,
even if all of the other Subscription resources had been destroyed.
This lead to partial execution of the onRetryTimer callback, until
it attempted to access the destroyed Subscription parent, causing an
invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.

To remedy the the invalid callback, a reference to the CallbackInfo
object in Thread Local Storage (tls) is made rather than a copy. This
prevents the CallbackInterface shared pointer reference count from being
incremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the CallbackInterface. Then
after the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the CallbackInterface has been destroyed.
In that case, the callback is Invalid and execution is prevented.
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Sep 16, 2015
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Sep 16, 2015
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Sep 23, 2015
This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

There were multiple potential causes this issue, and after investigating
several, it was determined that the onRetryTimer function was attempting
to access its parent Subscription instance during CallbackQueue execution.
This could lead to a scenario where invalid memory is accessed if all of
the other Subscription resources had been destroyed in another thread.

In that rare circumstance, the partial execution of the onRetryTimer
callback attempted to access the destroyed Subscription parent_, causing
an invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
inside the CallbackQueue::callOneCB function, which makes this a
time-dependant thread scheduling issue.

To solve this issue, some of the logic in onRetryTimer had to be reworked
to avoid accessing the parent Subscription in case it has been destroyed.
Thought was given to mutex locking the parent as it was accessed, but this
resulted in the Subscription class being destroyed while its member mutex
was locked in onRetryTimer inside the CallbackQueue thread, causing a
pthread mutex error. This occurs since mutexes cannot be destroyed if they
are locked. Instead, the need for the parent Subscription access in
onRetryTimer was removed entirely.

The function onRetryTimer() accesses its parent Subscription to attempt
to get the name of the topic that needs to be reconnected in the event
of a disconnect and add that name to a log message. We simply removed
the topic name from the log messages. The second problematic parent
instance access occurred to determine if the Transport type is TCPROS.
If it was anything else (say UDP), the publisher link is dropped.
Checking the type of Transport connection was moved into the main
Subscription thread via the onConnectionDropped() method, so the check
can happen even before the onRetryTimer method is bound to the
CallbackQueue. UDP retry connection dropping should function exactly
the same as before this patch.

The end result of this investigation and patch is less of a likelihood
for invalid pieces of memory to be accessed during normal ROS callback
execution.
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Oct 6, 2015
This issue manifested with periodic coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

The underlying issue was a copy of CallbackInfo being created in
callback_queue.cpp. By creating a copy of this object, its shared
pointer reference count for CallbackInterface was incremented,
preventing destruction of the CallbackInterface by an external thread,
even if all of the other Subscription resources had been destroyed.
This lead to partial execution of the onRetryTimer callback, until
it attempted to access the destroyed Subscription parent, causing an
invalid memory access, and coredump. This issue only occurs if
the Subscription is destroyed while the callback_queue is blocked by
attempting to lock the boost::shared_lock<boost::shared_mutex> rw_lock()
which would make this a time-dependant thread scheduling issue.

To remedy the the invalid callback, a reference to the CallbackInfo
object in Thread Local Storage (tls) is made rather than a copy. This
prevents the CallbackInterface shared pointer reference count from being
incremented (since only the reference is taken). To access the callback
for execution, a weak pointer is made from the CallbackInterface. Then
after the mutex lock is acquired, the weak pointer is used to create a
shared pointer for callback execution. If the weak pointer fails to
return a shared pointer, the CallbackInterface has been destroyed.
In that case, the callback is Invalid and execution is prevented.
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Oct 7, 2015
This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

It turned out the issue was caused by a few functions that were attached
to boost::signals2::signals. These functions did not have shared_ptr's to
the classes that they were bound to, causing a coredump if the original
classes were destructed. This was particularly problematic if the class
was destroyed while the signal function was in the middle of execution.
Due to a mutex locking strategy in CallbackQueue::CallOneCB() causing semi-
random thread execution timing, this issue was only seen on occasion.

The solution to this bug was to rework how DropSignals were implemented.
Additional function parameters for tracked_objects were added to for
dropping connections in order to allow for an object's shared_from_this
shared pointer to be passed into the function that was boost::bind'ed to
a signal2::signal. This prevented the destruction of a Subscriber from
destroying the objects which thse Drop functions were acting upon.

Reproduction of this issue was incredibly difficult, since it required
the timing of the class destruction to perfectly coincide with the
execution of the callback function using that class. We found this could
only be reliably reproduced when running overnight simulations of hundreds
of Subscribers by Rethink Robotics Automated QA team. The occurance of
this bug was producing roughly 1 to 5 coredumps per night in this testing
environment. Since this patch was applied, zero coredumps relating to this
issue have been discovered over the course of a week of testing.

Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this
callback connection strategy, and for QA allowing us to run these tests on
their nightly runs.
rethink-imcmahon pushed a commit to rethink-imcmahon/ros_comm that referenced this issue Oct 12, 2015
This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

It turned there were two issues at play: the onRetryTimer function did
not having a valid reference to the TransportPublusherLink class, and
the onConnectionDropped function was not registered to the
boost::signals2::signal in such a way that tracked the existence of
the underlying ROS Subscriber connection. The first issue (which was
uncovered in ros#577) was resolved by calling getInternalTimerManger()::add
with shared_from_this instead of an empty shared pointer. Testing this
fix then revealed the second issue. Initially, the onConnectionDropped
seemed to be fixed by keeping track of the TransportPublisherLink
class with a shared_from_this pointer, though it did not pass all of
the ros_comm tests. However, changing anything this deep in ROS runs the
risk of unintended consequences, and upon further testing, we discovered
that ros_controls/realtime_tools::RealTimePublishers were not being
properly notified when connections were dropped, and entering deadlock
conditions.

Taking inspiration from the boost::signals2 Automatic Connection Management
tutorial [1], we then used signals2::slot_type and track() mechanisms in order
to maintain a reference to the TransportPublisherLink class during execution
of the onConnectionDropped() code. This was functionally similar to the
previous approach, with the added benefit that boost::signals2 handles
all shared_from_this() reference counting under the hood, without preventing
destruction of the parent class (which we believe to be the source of the
unintended RealTimePublisher deadlocking). Upon the tracked object's
destruction, the signal2 library automatically disconnects the signal.

Reproduction of this issue was incredibly difficult, since it required
the timing of the class destruction to perfectly coincide with the
execution of the callback function using that class. We found this could
only be reliably reproduced when running overnight simulations of hundreds
of Subscribers by Rethink Robotics Automated QA team. The occurrence of
this bug was producing roughly 1 to 5 coredumps per night in this testing
environment. Since this patch was applied, zero coredumps and
RealTimePublisher deadlocks have been seen over the course of a several
days of testing.

Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this
callback connection strategy, and for QA allowing us to run these tests on
their nightly runs.

[1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
@rethink-imcmahon
Copy link

@dirk-thomas I think it's safe to close this issue now that #677 is merged. Thanks!

@stwirth
Copy link
Author

stwirth commented Oct 15, 2015

Awesome @rethink-imcmahon! Thanks for all the effort!

@jasonimercer
Copy link
Contributor

Is there an ETA on when this fix will transition to the public debs? We're having lots of problems with this on the VMs here at ClearPath.

@dirk-thomas
Copy link
Member

The changes in shadow-fixed have been announced six days ago (http://lists.ros.org/pipermail/ros-release/2015-October/004708.html) and we are waiting for people to test them thoroughly during that soak period. Probably within a week from now it will be synced to public.

Karsten1987 pushed a commit to Karsten1987/ros_comm that referenced this issue Nov 6, 2015
This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

It turned there were two issues at play: the onRetryTimer function did
not having a valid reference to the TransportPublusherLink class, and
the onConnectionDropped function was not registered to the
boost::signals2::signal in such a way that tracked the existence of
the underlying ROS Subscriber connection. The first issue (which was
uncovered in ros#577) was resolved by calling getInternalTimerManger()::add
with shared_from_this instead of an empty shared pointer. Testing this
fix then revealed the second issue. Initially, the onConnectionDropped
seemed to be fixed by keeping track of the TransportPublisherLink
class with a shared_from_this pointer, though it did not pass all of
the ros_comm tests. However, changing anything this deep in ROS runs the
risk of unintended consequences, and upon further testing, we discovered
that ros_controls/realtime_tools::RealTimePublishers were not being
properly notified when connections were dropped, and entering deadlock
conditions.

Taking inspiration from the boost::signals2 Automatic Connection Management
tutorial [1], we then used signals2::slot_type and track() mechanisms in order
to maintain a reference to the TransportPublisherLink class during execution
of the onConnectionDropped() code. This was functionally similar to the
previous approach, with the added benefit that boost::signals2 handles
all shared_from_this() reference counting under the hood, without preventing
destruction of the parent class (which we believe to be the source of the
unintended RealTimePublisher deadlocking). Upon the tracked object's
destruction, the signal2 library automatically disconnects the signal.

Reproduction of this issue was incredibly difficult, since it required
the timing of the class destruction to perfectly coincide with the
execution of the callback function using that class. We found this could
only be reliably reproduced when running overnight simulations of hundreds
of Subscribers by Rethink Robotics Automated QA team. The occurrence of
this bug was producing roughly 1 to 5 coredumps per night in this testing
environment. Since this patch was applied, zero coredumps and
RealTimePublisher deadlocks have been seen over the course of a several
days of testing.

Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this
callback connection strategy, and for QA allowing us to run these tests on
their nightly runs.

[1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
@jasonimercer
Copy link
Contributor

We've put the shadow build of ros_comm 1.11.15-0 on our internal apt servers here at ClearPath and are running it without incident.

@dirk-thomas
Copy link
Member

Thanks for reporting back.

We will sync it once the shadow-fixed repo has no regressions. Currently 1.11.16 is being built for a different reason.

IanTheEngineer added a commit to RethinkRobotics-opensource/ros_comm that referenced this issue May 25, 2016
This issue manifested with random coredumps in the onRetryTimer()
callback in transport_publisher_link.cpp. These crashes were infrequent,
and extremely difficult to reproduce. They appeared to happen when
many Subscriptions were being destroyed, which made it likely to be
a destruction scheduling anomaly.

It turned there were two issues at play: the onRetryTimer function did
not having a valid reference to the TransportPublusherLink class, and
the onConnectionDropped function was not registered to the
boost::signals2::signal in such a way that tracked the existence of
the underlying ROS Subscriber connection. The first issue (which was
uncovered in ros#577) was resolved by calling getInternalTimerManger()::add
with shared_from_this instead of an empty shared pointer. Testing this
fix then revealed the second issue. Initially, the onConnectionDropped
seemed to be fixed by keeping track of the TransportPublisherLink
class with a shared_from_this pointer, though it did not pass all of
the ros_comm tests. However, changing anything this deep in ROS runs the
risk of unintended consequences, and upon further testing, we discovered
that ros_controls/realtime_tools::RealTimePublishers were not being
properly notified when connections were dropped, and entering deadlock
conditions.

Taking inspiration from the boost::signals2 Automatic Connection Management
tutorial [1], we then used signals2::slot_type and track() mechanisms in order
to maintain a reference to the TransportPublisherLink class during execution
of the onConnectionDropped() code. This was functionally similar to the
previous approach, with the added benefit that boost::signals2 handles
all shared_from_this() reference counting under the hood, without preventing
destruction of the parent class (which we believe to be the source of the
unintended RealTimePublisher deadlocking). Upon the tracked object's
destruction, the signal2 library automatically disconnects the signal.

Reproduction of this issue was incredibly difficult, since it required
the timing of the class destruction to perfectly coincide with the
execution of the callback function using that class. We found this could
only be reliably reproduced when running overnight simulations of hundreds
of Subscribers by Rethink Robotics Automated QA team. The occurrence of
this bug was producing roughly 1 to 5 coredumps per night in this testing
environment. Since this patch was applied, zero coredumps and
RealTimePublisher deadlocks have been seen over the course of a several
days of testing.

Major credit to Jeff Adams (@jeffadamsc) for pursuing the shared_from_this
callback connection strategy, and for QA allowing us to run these tests on
their nightly runs.

[1] http://www.boost.org/doc/libs/1_54_0/doc/html/signals2/tutorial.html#signals2.tutorial.connection-management
rsinnet pushed a commit to MisoRobotics/ros_comm that referenced this issue Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants