-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rospy: Connection / Memory Leak, adding conn. to closed _SubscriberImpl #544
Comments
DISCLAIMER: The internal use case where this series of 3 leak issues arose involves RobotWebTools/rosbridge_suite as the rospy node doing the subscribing and generating the leaked connections (the Thus, this is why I don't have a good test script for the issue. |
FYI: ADDITIONAL debug tools Note that
|
Oh, this also did not (noticeably) happen before we upgraded everything from 32-bit Groovy to 64-bit Indigo, but that could be another red-herring as we have seen many process loads increase after the conversion, and so the race condition might have just been more likely to present itself now that the system is more heavily loaded. |
Thanks @rethink-rlinsalata for this detailed bug report, we'll see what we can do about it. |
Protect against race condition where a new `TCPROSTransport` connection is created and added to a `_SubscriberImpl` that is already closed. Checks and returns False if closed so transport creator can shutdown socket connection. Fixes issue ros#544.
I finally have read through the code in more detail. I could imagine the following order of event:
It would also describe why it only rarely happens. Usually the thread will not immediately start after invoking If my guess is right I think the best approach is to delay starting the thread until after the connection has been added (#603). Could you please try if this makes it work in your complex use case? |
Protect against race condition where a new `TCPROSTransport` connection is created and added to a `_SubscriberImpl` that is already closed. Checks and returns False if closed so transport creator can shutdown socket connection. Fixes issue #544.
Should be addressed by #603. I moved forward and merged the patch to include it in the upcoming patch release. Otherwise it would likely take a long time until the next patch release. @rethink-rlinsalata Please comment if you had a chance to test the change. |
I am hitting this:
Using ROS kinetic. |
@krixkrix Please post these details. |
Hmmm, turns out to be harder to reproduce today than I imagined. The error appeared when having a publisher and subscriber in two different threads in the same node, accessing the same topic. The publisher would disconnect, while the subscriber was connected and still waiting for its first message. |
This is the third of a 3 piece series of memory/thread leak issues (#520, #533) I've been investigating, that mostly arise from race conditions during intensive barrages of rapidly creating new subscriptions and unsubscribing in a rospy node to topics coming from an already heavily-loaded (robot) node.
A problem can occur where a subscriber / connection is "leaked", such that the objects that run the connection are orphaned and never cleaned up, and the socket connection is still held open by the leaked thread, which is continuing to run the TCP receive / callback loop, even though the originating "Subscriber" is "closed" or gone. Furthermore, new Subscribers to the same Topic and Publisher will not re-use those leaked, underlying connections like they normally would (since they are orphaned/leaked), thus leading to an ever increasing process memory in the rospy process that is Subscribing _AND_ in the process of the Publisher Node on the other end of the TCP socket connection (even a roscpp-based node) -- until one or both are killed by the kernel oom_killer.
The more technical description: The two objects in charge of the connection are the underlying, per-topic
rospy.topics._SubscriberImpl
and the TCP connection objectrospy.impl.tcpros_base.TCPROSTransport
. A major part of the problem seems to be that somehow (race condition?) there is a newTCPROSTransport
connection object being created and added to the<_SubImpl>.connections[]
array of a_SubImpl
that has already been closed (_SubImpl.closed==True
).For the most direct evidence, we can see the bad/bug state by adding two log debug checks for the
_SubImpl.closed
flag in the_SubImpl.add_connections()
and_SubImpl.receive_callback()
functions, as seen in the diff here:_SubscriberImpl.add_connection():
_SubscriberImpl.receive_callback():
Below is a more detailed walkthrough, but one of the major mysteries still is why and when the
_SubImpl
is getting closed, and from where and how is it happening in a way that allows this new connection to be added.Details:
One of the relevant sections of code is the
create_transport()
function [A] intcpros_pubsub.py
(ros_comm/clients/rospy/src/rospy/impl/tcpros_pubsub.py
Line 246 in ee6b648
_SubImpl
instance (I'll call<sub>
) is captured near the top from the topic manager.TCPROSTransport
object is created (I'll call<conn>
) and a separate thread is spun-up to establish the TCP socket connection after which at some point that thread will continue off on its own running the TCP msg<conn>.receive_loop()
[B]. This connection thread was passed a reference to the<sub>.receive_callback()
function which is called by the<conn>.receive_loop()
and is 1 of 2 references to the_SubImpl
object being held on to by the<conn>
object.<conn>
object, the function continues on after spawning off the thread, and adds the<conn>
object to the_SubImpl
object's<sub>.connections[]
array, with a call to<sub>.add_connection(<conn>)
[C].Now at some point while all this is happening the
_SubImpl
object is closed -- (or potentially anytime after that first reference to<sub>
is obtained, but before the<sub>.add_connection(<conn>)
method is executed). _I could not figure out why or how this happens_ but I'm fairly sure about the timing. This means that the_SubImpl
instance has no parentSubscriber
(it was deleted), and theTopicManager
no longer has a reference to it. The<sub>.connections[]
array and the<sub>.callbacks[]
have both been cleared out (len 0
) and the<sub>.closed = True
flag has been set.However, after that first reference to
<sub>
was obtained at the top, at no other time does the<conn>
nor the rest of the connection code check or care about the state of the_SubImpl
object. There also appears to be nothing locking out the connection creation+callback+addition processes and the_SubImpl
closing processes from interfering with each other. This means when the<sub>.add_connection(<conn>)
function is run, the perfectly happy<conn>
is being added to a_SubImpl
object that is supposed to beclosed = True
but never checks the state, so you end up with a connections array of length 1<sub>.connections[<conn>]
. And when the TCP callback loop thread calls the<sub>.receive_callback()
function, it just returns since there are zero<sub>.callbacks[]
to iterate over, and neither the loop nor the callback check the state of the_SubImpl
; the loop only checks the state of the socket, ros, and<conn>.done
(which isFalse
).You can see a sample of the two objects from one of the leaked threads during a test run here (obtained with dowser):
_SubscriberImpl
(<sub>
): _SubImpl-dowserTCPROSTransport
(<conn>
): TCPROSTransport-dowserIdentifying the Bug State
Note that
rosnode info
(and the likes) will not show any information about the leaked topic connections, as they think they are closed. However, the leaked connections themselves can be seen:$ netstat --numeric-ports -peev | grep <rospy subscriber node's PID>
)ps
on the command line by looking at the rospy node's Light-Weight Processes (LWP's, or threads) with timestamps that are "old" and don't go away (during Subscriber creation and unsubscriptions).$ ps -L -p <rospy PID> -o lwp,lstart,stat,wchan
)The text was updated successfully, but these errors were encountered: