-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect publisher-based logic can cause non-terminating leaked threads on bad subscribers #533
Comments
Thanks for the detailed report. Is there a reproducible way to trigger this behavior? I would like to be able to at least verify that a potential patch improves the behavior? |
If I follow you description based on just reading the code I would propose the following approach: In the file It could do that right after the exception has been caught (after this line:
issue533 : https://github.com/ros/ros_comm/tree/issue533):
Checking the Anyway it would be nice to "confirm" that this actually fixes it... |
+1 |
+1, it's hard to know what the real consequences are with just the description of the issue and no tests, but it seems like a straight forward description of the issue and matching change. |
+0 what william said |
If nobody can come up with a reproducible example I will go ahead and merge the proposed fix (#538) since it is expected to address the described problem and I don't see a way it can be harmful. |
Thanks for your fix and looking at this. I've tested the fix (via latest indigo-devel) most of the morning and it appears to have resolved the bug. I will continue to test overnight, but I was able to see many of the same errors that previously triggered the "Unknown error initiating..." and none of them started the problematic infinite sleep loop, nor were the direct cause of any leaked threads I could see. Checking for an (unknown error) closed My only concern in the fix is the part reaching down and setting the ros_comm/clients/rospy/src/rospy/impl/tcpros_base.py Lines 568 to 569 in ee6b648
And yes, regarding the lack of a test script, unfortunately, while definitely reproducible, this bug is highly intermittent, involves connections between multiple entities, and seems to be best triggered by some internal code and scenarios that I've been doing my best to strip down to the bare basics.... but alas, my closest efforts seem to just raise more questions (and take significantly longer to trigger the bug) than they help towards simplifying the problem. |
im having a similar issue. in my case it looks like a problem with rqt_reconfigure, though there was no user interaction when the error happened. edit: probably happened due to threading, i've set a lock to a part of my node and it looks like ok now.
|
The
robust_connect_subscriber()
can get in a state where the local subscriber side of the connection becomes invalid in some way, but the publisher still exists - which leads to never-ending attempts to reconnect.The reconnect logic in
robust_connect_subscriber()
oftcpros_pubsub.py
overwrites the flag to attempt reconnect based solely on the state of the target publisher, which can hide "FATAL"Exception
errors from "Unknown errors" in the function's call toTCPROSTransport().connect()
oftcpros_base.py
that close the TCPROSTransport instance and according to the Exception handling shouldn't reconnect ("# FATAL: no reconnection as error is unknown
"). This leads to a 'leaked' thread that starts a never-ending loop of always failing to connect, sleeping for an exponentially increasing amount of time, and then over-riding the 'no reconnection' state and always trying to reconnect - and the thread is never cleaned up, even when the parent Subscriber that started it goes away.The chain of events/calls goes like this:
tcpros_pubsub.py
,TCPROSHandler().create_transport()
starts the thread with the target ofrobust_connect_subscriber(...)
, at ~L250.robust_connect_subscriber()
starts a while loop based on if the TCPROSTransport instance (conn
) isnot done
, and then does a try/except wrapped call toconnect()
, at ~L168tcpros_base.py
, if theconnect()
function throws a (standard)Exception
type exception, it is caught in the following lines, logged as an "Unknown error" and then the TCPROSTransport instance is closed (which sets thedone
variable toTrue
and various children toNone
), with a comment indicating there should be no reconnection. ~L563-569Exception
is turned into aTransportInitError
before being re-raised, and more so, in the exception handling for theTransportInitError
the conn.done flag is overwritten based only on the state of the publisher in this line oftcpros_pubsub.py robust_connect_subscriber()
(~L175)https://github.com/ros/ros_comm/blob/indigo-devel/clients/rospy/src/rospy/impl/tcpros_pubsub.py#L175
read_header()
step of theconnect()
function when it tries to access a member of aNone
valuedprotocol
,self.protocol.buff_size
(~ tcpros_base.py#L618), printing the following error:Multiple instances of these loose reconnect threads can exists in the same python process for the same topic. Here are some samples of the initial error messages when the loop first begins, in four unique threads (Note that there are plenty of other subscribers that worked fine; this is an intermittent problem):
These lines in
robust_connect_subscriber()
are also not the only places I've seen this try/sleep*2/try again code (e.g._reconnect()
in tcpros_base.py) however, it is the only place I've seen the conn.done being overwritten and logging that it is going to be sleeping -- so I'm not sure if any other sleeps are going on forever too.The text was updated successfully, but these errors were encountered: