rosmaster leaves sockets in CLOSE_WAIT state #610

lrasinen · 2015-04-27T13:06:44Z

Somehow related to #495, but not reproducible as easily.

In some scenarios, likely involving a node (or nodes) restarting, rosmaster can leave sockets in a CLOSE_WAIT state, eventually exhausting the limit on open file descriptors and becoming unresponsive.

Our system has about 30 nodes and 150 topics, and when the system is run without restarting, there is a steady creep upwards in the amount of CLOSE_WAIT sockets in systemwide monitoring. Detailed examination assigns most of these to the rosmaster process. See picture.

The sudden jump in the image has not been fully explained but it's likely associated with a hardware problem that caused parts of the system to restart repeatedly. Still, the trend is obvious.

After diagnosing the problem we've been trying to reproduce it with a simpler setup and/or collect logs, and the results are inconclusive. After comparing the rosmaster logs with lsof output, the leaks do appear to be related to ServerProxy objects.

It does not seem to matter whether the node uses roscpp or rospy, or whether it's a publisher-only or a subscriber-only. There are even instances of CLOSE_WAIT sockets being associated with nodes that do not publish or subscribe anything (they just read some parameters on startup).

We've run a belt-and-suspenders approach since hitting the limit, which involves a) nightly restarts and b) keeping the ServerProxy cache trimmed (patch below).

The nightly restarts have kept us from hitting the limit, but we're planning to run without them for a while to see if the ServerProxy patch helps. The default daily cycle already looks different:

The spikes are due to rosbag restarting every hour so we get bags to manageable size.

When there is a leak, it seems to be contained:

(These are system-wide numbers, rosmaster has 90 CLOSE_WAIT sockets)

--- util.py.old 2015-04-22 12:23:11.054085143 +0300
+++ util.py.new 2015-04-22 12:23:18.418073585 +0300
@@ -45,7 +45,12 @@
 except ImportError:
     from xmlrpclib import ServerProxy

-_proxies = {} #cache ServerProxys
+import collections
+import threading
+_proxies = collections.OrderedDict()
+_lock = threading.Lock()
+N = 100
+
 def xmlrpcapi(uri):
     """
     @return: instance for calling remote server or None if not a valid URI
@@ -56,11 +61,23 @@
     uriValidate = urlparse(uri)
     if not uriValidate[0] or not uriValidate[1]:
         return None
-    if not uri in _proxies:
-        _proxies[uri] = ServerProxy(uri)
+    # Experimental bit
+    with _lock:
+        proxy = _proxies.get(uri, None)
+        if proxy is None:
+            proxy = ServerProxy(uri)
+        else:
+            # OrderedDict requires a deletion to update order
+            del _proxies[uri]
+        _proxies[uri] = proxy
+        # Trim to size
+        while len(_proxies) > N:
+            _proxies.popitem(last=False)
+
     return _proxies[uri]


 def remove_server_proxy(uri):
-    if uri in _proxies:
-        del _proxies[uri]
+    with _lock:
+        if uri in _proxies:
+            del _proxies[uri]

The text was updated successfully, but these errors were encountered:

dirk-thomas · 2015-05-04T17:55:43Z

Please try to provide a reproducible example. Without that I simply can't change core logic like this. Such code changes need to be "validated" that they actually address the problem.

Also if you propose code changes do not past the diff in a ticket. Please use a PR for that instead. But your current patch would likely not be accepted - hard coding a fixed upper limit of 100 proxies looks very bogus to me.

garaemon · 2015-05-05T02:31:09Z

I have similar problem described in #325

lrasinen · 2015-05-05T06:34:57Z

@dirk-thomas I understand the situation you're in. So far I've spent about a week in trying to reproduce this reliably in laboratory conditions, and all I've managed to do is crash my computer twice by launching thousands of instances by accident :)

Having said that, I think the data does point to a leak in ServerProxy caching. The diff in question wasn't intended for merging as such, it was more for "this is what we tried and it helps for us". For example, the value N=100 was selected after careful analysis by pulling a nice round number out of a hat.

My pet theory for the root cause that items in _proxies only get dropped if an API gets bumped. So a node comes up, rosmaster sends it some sort of update and then the node shuts down, leaving the ServerProxy alive and the socket in CLOSE_WAIT state.... and then master never needs to send any more updates so it'd notice the other end has gone cold.

Thus having a limited number of proxies help; if one takes that to the limit, then disabling caching entirely helps too. For example, there's another xmlrpcapi() function in rospy/core.py which doesn't do any caching.

I'll keep up my attempts to get a lab-reproducible set of nodes; I'm open to any suggestions on how to tease out the bug more easily.

aginika · 2015-05-05T08:55:16Z

We have similar problem on baxter too...

rethink-imcmahon · 2015-05-06T22:19:09Z

@aginika please let me know the symptoms of your issue on Baxter

lrasinen · 2015-05-08T09:04:05Z

I did some experiments, and it's easy to show that ServerProxy objects pile up. I ran these two loops:

$ while rostopic pub -1 foo std_msgs/String "hi there"; do :; done
$ while : ; do rostopic echo -n 1 foo; done

combined with a debug log entry in util.py that prints out the size of the _proxies dict on every call to xmlrpcapi(), and lo and behold, it does creep up every 3 seconds. However, this doesn't yet reproduce the CLOSE_WAIT situation. I'll try to devise a nastier test later today.

lrasinen · 2015-05-08T12:45:21Z

Contrary to what I've said earlier, rospy vs roscpp might actually be significant. The Python HTTP server responds with HTTP/1.0, so there's no keep-alive and the XML-RPC client in rosmaster knows this.

I'll write a simple C++ node when I next have time to dedicate to this issue. Might be a week or two, though, so I don't mind if someone beats me to it :)

dirk-thomas · 2015-05-08T17:09:24Z

@lrasinen I can't reproduce a problem with your two line command line invocations of rostopic. I also don't think that rospy is affected at all and the command line tool use rospy. Are you using the latest released packages of ROS on Indigo? Can you please post your exact changes (preferably in a ready-to-run fork) which reports the increasing proxy counts?

Regarding the increasing number of sockets in CLOSE_WAIT state I think I found an easy way to reproduce it. I did the following:

start a roscore in one shell
repeatedly invoke the talker / listener launch file from the roscpp_tutorials in a second shell (https://github.com/ros/ros_tutorials/blob/indigo-devel/roscpp_tutorials/launch/talker_listener.launch)
- invoke roslaunch path/to/talker_listener.launch
- let it run for a few seconds
- CTRL+C it
calling lsof | grep rosmaster | grep -c CLOSE_WAIT in a third shell shows a constantly increasing number

When doing the same with the rospy_tutorials talker / listener launch file the problem does not exist for me.

lrasinen · 2015-05-11T07:18:21Z

Gah, sorry for omitting the version details earlier. Anyway, this is on Ubuntu Trusty and the rosmaster package is "1.11.10-0trusty-20141229-2145-+0000". As far as I can tell, that's the latest. If not, I'll have a chat with our IT...

My exact change was this oneliner:

--- util.py.orig    2015-05-08 11:12:22.926255672 +0300
+++ util.py 2015-05-08 14:50:23.494047619 +0300
@@ -58,6 +58,7 @@
         return None
     if not uri in _proxies:
         _proxies[uri] = ServerProxy(uri)
+    print "proxies", len(_proxies), "uri", uri
     return _proxies[uri]

(Also at https://github.com/lrasinen/ros_comm/tree/proxy-debug for merging convenience; but that code just has the patch applied to it and not tested as such)

You're right, the shell examples don't have the CLOSE_WAIT problem, but with the above change you can see the slow growth in the amount of proxies. That's what led me to reconsider the difference between roscpp and rospy.

I had a look at the talker / listener launch and that looks very similar to the conditions we're experiencing (where the role of the Ctrl-C is taken by a camera node crashing). I'm willing to say it's a good reproduction of the problem.

dirk-thomas · 2015-05-11T21:16:00Z

I don't think that the increasing number of cached proxies is the primary problem. The number also increases when running the rospy_tutorials talker_listener launch file but that does not leave any sockets in CLOSE_WAIT state.

dirk-thomas · 2015-05-11T21:51:26Z

While looking into the problem a bit more I found the following weird behavior:

when running the C++ talker_listener repeatedly against a roscore from Debian packages it shows the increasing number of sockets in CLOSE_WAIT as mentioned before
when running it against a roscore from a devel space (where all ROS core packages are in the workspace) the socket count stays at zero
when running it against a roscore from an install space it behaves the same as from Debian packages

In all cases the launch file was started from the same environment (devel space) and the sources in the workspace had the same version as the Debian packages.

lrasinen · 2015-05-12T07:14:05Z

My theory about the role of server proxies goes like this.

client uses rospy
The client responds to the HTTP request with HTTP/1.0, causing httplib to close the connection immediately. The socket is closed and the leaking proxies only leak their miniscule bit of memory (but since the connection is closed, there's not much benefit from caching either)
client uses roscpp
The client responds with HTTP/1.1, causing httplib to leave the connection open in case it's used again. When the client goes away, socket stays behind.

Anyway, to test this I wrote even more debugging at https://github.com/lrasinen/ros_comm/tree/moalv (Mother of all layering violations)

My afternoon schedule looks busy so I can't test the roscpp part of the above theory today, I'll try to get back to this later this week though.

lrasinen · 2015-05-12T14:08:04Z

I was able to get a few CLOSE_WAIT sockets by running 100 copies of rosrun roscpp_tutorials anonymous_listener in one shell, and rostopic pub -1 chatter std_msgs/String "hi there" in another.

In the ServerProxy debug output we can see file descriptor 7:
uri http://tardis:49228/, fd 7

and its counterpart in lsof -ni listing:
rosmaster 15908 lrasinen 7u IPv4 35406754 0t0 TCP 127.0.0.1:56972->127.0.1.1:49228 (CLOSE_WAIT)

But most of the proxies (currently about 400) don't have any attached sockets so there might be some other factor still at play. Writing to a closed socket will eventually fail and close the connection, so perhaps there's some list that gets pruned before that occurs.

dirk-thomas · 2015-05-13T10:49:32Z

I disabled the keep-alive functionality for the xml rpcs (

ros_comm/utilities/xmlrpcpp/src/XmlRpcServerConnection.cpp

Line 42 in ace5b9c

_keepAlive = true;

ros_comm/utilities/xmlrpcpp/src/XmlRpcServerConnection.cpp

Line 134 in ace5b9c

_keepAlive = true;

) but the problem is still the same.

mgrrx · 2016-05-30T15:56:04Z

We observed the same problem on our robots. I finally debugged it down to a single line of code that has to be added in rosgraph (mgrrx@4f06033):

diff --git a/tools/rosgraph/src/rosgraph/xmlrpc.py b/tools/rosgraph/src/rosgraph/xmlrpc.py
index 7d9aad8..a9c3d52 100644
--- a/tools/rosgraph/src/rosgraph/xmlrpc.py
+++ b/tools/rosgraph/src/rosgraph/xmlrpc.py
@@ -76,6 +76,8 @@ def isstring(s):
         return isinstance(s, str)

 class SilenceableXMLRPCRequestHandler(SimpleXMLRPCRequestHandler):
+    protocol_version = 'HTTP/1.1'
+
     def log_message(self, format, *args):
         if 0:
             SimpleXMLRPCRequestHandler.log_message(self, format, *args)

Caching the ServerProxies in the rosmaster only makes sense if HTTP/1.1 is used. With HTTP/1.0, a connection to the rosmaster remains in CLOSE_WAIT until the entry in the _proxies dictionary gets deleted or by e.g. calling _proxies[uri]._ServerProxy__close() to close the socket after a timeout or something similar.

Update: Unfortunately this fix didn't solve all problems we observed. I now replaced the httplib backend of xmlrpclib by python-requests. In combination with the change from above I could reduce the number of CLOSE_WAIT connections to 1 on our system (~80 nodes). Previously, we had more than 100 CLOSE_WAIT connections when starting the robot.

Can you please verify if that fixes the issue. I'd like to open a pull request as soon as possible.

…brary for HTTP/1.1 requests

dirk-thomas · 2016-05-31T15:42:17Z

Can you please clarify what exact change you are proposing.

mgrrx · 2016-05-31T16:46:39Z

My first attempt was to make sure HTTP/1.1 is used for every connection. mgrrx/ros_comm@4f06033 fixed this and eliminated all CLOSE_WAITs in my dummy world (playing around with talker/listener nodes from the tutorial packages). However, after testing on the real robot, I found out that it is still not working and connections remain in the CLOSE_WAIT state.

After a bit of googling and playing around with xmlrpclib and the underlaying httplib, I figured out that:

httplib's HTTPConnetion is not thread-safe
the HTTP/1.1 implementation is not that clear and people started writing httplib2 and python-requests to replace this library.

If I understand the httplib code correctly, connections are only closed if you are sending the header "connection: close". Although a client might close the connection, the socket remains in the CLOSE_WAIT state then.

I found some other projects that are using the requests library as the http backend for xmlrpclib. I implemented that in rosgraph and updated other packages that are using xmlrpclib mgrrx@f1f3340 . The subsequent commits update the tests and docstrings.

So my proposed solution is to use the requests library for http connections and to make sure HTTP/1.1 is used everywhere.

dirk-thomas · 2016-06-02T21:09:32Z

Please see #371 for a related patch. If you could provide a PRs for these changes which passes all the cases which we blocking the previous PR that would be great. But such a change will require extensive testing and might only be considered for future ROS releases (maybe still for Kinetic but I am not sure about that).

mgrrx · 2016-06-03T08:12:33Z

My implementation passes the test cases mentioned in #371 without blocking any terminal. I totally agree with you that this needs extensive testing but I'll keep testing the modified core on our robots anyway before opening the PR.
It would be helpful if anyone could test the patches in kinetic as we are still using indigo…

…nd is related to ros#610 as connections remained in CLOSE_WAIT. Added a check for CLOSE_WAIT connections in the RegManager loop

dirk-thomas · 2017-01-27T16:50:20Z

The referenced PRs are currently actively being tested. As soon as we have made sure that the changes don't introduce regressions they will be merged and released into Kinetic. Afterwards they will be considered for being backported to Jade / Indigo.

You could try the patches from the referenced PRs on your robot and report back in these about your experience. It always help to know if a patch does or doesn't work for more users.

alecive · 2017-01-27T22:22:01Z

Thank you @dirk-thomas for the answer. So, to make it clear, what you suggest is to compile ros_comm from https://github.com/mgrrx/ros_comm/tree/fix/610 and see if it works for us, am I right?

In our Baxter machine we have ROS indigo installed system-wide from the debian packages (yes I don't like this either, but I inherited the robot from other people), so we would need to do the following:

Uninstall all the system packages
Compile and download ROS from source by using this tutorial : http://wiki.ros.org/indigo/Installation/Source
Look for the ros_comm package, and replace it with this version https://github.com/mgrrx/ros_comm/tree/fix/610
Recompile ROS to link against the new version of ros_comm
Recompile the Baxter code to link against the new version of ros_comm
Recompile our code to link against the new version of ros_comm
Cross fingers, and test

It's not a problem doing that , I just wanted to be sure that we do all the right steps in order to avoid having further problems down the road.

alecive · 2017-01-27T22:29:41Z

Plus, @rethink-imcmahon correct me if I'm wrong, but we do not have any way to recompile manually the patched version of ROS on the Baxter machine. Am I right?

dirk-thomas · 2017-01-27T22:31:25Z

Since the PR only touches Python code you don't need to compile anything. Simply clone the branch into your workspace (or create an empty one), build the ws, and source the result. That will put all the Python code on the PYTHONPATH in front of the system packages.

…brary for HTTP/1.1 requests

* rospy: switched from poll to epoll for tcpros. Fixes #107 for me and is related to #610 as connections remained in CLOSE_WAIT. Added a check for CLOSE_WAIT connections in the RegManager loop * using hasattr instead of dir * Tests for #107, currently failing of course * Simplified the code for selecting the Poller backend. Added a fallback to select.poll if select.epoll is not available in Linux. OS X will still use kqueue and Windows the noop operator. * Removed copyrights from tests Removed whitespace * Rewrote the check for closed connections. The crucial part is now guarded but the lock is only acquired when connection list needs to be modified. This avoids unnecessary locking. * Remove copyright

* Check CLOSE_WAIT sockets and close them whenever call xmlrpcapi

* Check and close CLOSE_WAIT socket whenever xmlrpcapi is called

dirk-thomas · 2017-07-21T18:45:28Z

WIth #1104 being merged can you please double check that all cases from this ticket are being covered and that this can be closed.

dirk-thomas · 2017-08-15T17:01:51Z

Assuming that the referenced PR addressed the problem I will close this. Please feel free to comment and this can be reopened.

dirk-thomas mentioned this issue May 4, 2015

rosmaster doesn't fully clean up closed sockets (Groovy) #325

Closed

ban-masa mentioned this issue Mar 22, 2016

baxter実機でroslaunchなどが正常に機能しない start-jsk/jsk_apc#1182

Closed

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 30, 2016

Fix for ros#610: Use HTTP/1.1 in XMLRPCRequestHandler

4f06033

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016

Fix for ros#610: Added python-requests which provides a clean http li…

f1f3340

…brary for HTTP/1.1 requests

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016

ros#610: rosgraph.xmlrpc.ServerProxy for roswtf

97fde05

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016

ros#610: updated comment

c0ed1bd

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016

ros#610: updated comment

4136451

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016

ros#610: updated package.xml

5783def

mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 11, 2016

rospy: switched from poll to epoll for tcpros. Fixes ros#107 for me a…

365bc51

…nd is related to ros#610 as connections remained in CLOSE_WAIT. Added a check for CLOSE_WAIT connections in the RegManager loop

mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 13, 2016

rospy: switched from poll to epoll for tcpros. Fixes ros#107 for me a…

fb36cd4

…nd is related to ros#610 as connections remained in CLOSE_WAIT. Added a check for CLOSE_WAIT connections in the RegManager loop

mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 14, 2016

Fix for ros#610: Use HTTP/1.1 in XMLRPCRequestHandler

0391c5f

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

Fix for ros#610: Use HTTP/1.1 in XMLRPCRequestHandler

e629d90

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

Fix for ros#610: Added python-requests which provides a clean http li…

b742bac

…brary for HTTP/1.1 requests

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

ros#610: rosgraph.xmlrpc.ServerProxy for roswtf

82bb411

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

ros#610: updated comment

8b3066d

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

ros#610: updated package.xml

79da64d

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017

ros#610: encoding

b077ed1

mikepurvis added a commit that referenced this issue Jan 31, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

6d2d223

mikepurvis added a commit that referenced this issue Jan 31, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

cb356ef

mikepurvis added a commit that referenced this issue Feb 7, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

d546d4d

mikepurvis added a commit that referenced this issue Feb 7, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

40621d3

mikepurvis added a commit that referenced this issue Feb 7, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

cc048f5

dirk-thomas mentioned this issue Feb 8, 2017

Use HTTP1.1: we get persistent connections for free #371

Closed

mikepurvis added a commit that referenced this issue Feb 9, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

f1b1c13

mikepurvis added a commit to clearpathrobotics/ros_comm that referenced this issue Feb 10, 2017

rosmaster leaves sockets in CLOSE_WAIT state ros#610 (ros#834)

25897a7

mikepurvis added a commit that referenced this issue Feb 10, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

dbbdd13

efernandez pushed a commit to clearpathrobotics/ros_comm that referenced this issue Feb 23, 2017

rosmaster leaves sockets in CLOSE_WAIT state ros#610 (ros#834)

d33b657

mikepurvis added a commit that referenced this issue Feb 24, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610 (#834)

22f43d7

otamachan added a commit to otamachan/ros_comm that referenced this issue Jun 13, 2017

Trial fix for ros#610

3ff8ce4

* Check CLOSE_WAIT sockets and close them whenever call xmlrpcapi

otamachan added a commit to otamachan/ros_comm that referenced this issue Jun 13, 2017

Trial fix for ros#610

daf3229

* Check and close CLOSE_WAIT socket whenever xmlrpcapi is called

otamachan mentioned this issue Jun 26, 2017

Close CLOSE_WAIT sockets (#610) #1086

Closed

dirk-thomas closed this as completed Aug 15, 2017

fengqikai1414 mentioned this issue Jul 15, 2018

dreamview node keep alive problem ApolloAuto/apollo#4676

Closed

vmayoral mentioned this issue Nov 2, 2018

Sockets left open and in CLOSE_WAIT state in ROS aliasrobotics/RVD#90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rosmaster leaves sockets in CLOSE_WAIT state #610

rosmaster leaves sockets in CLOSE_WAIT state #610

lrasinen commented Apr 27, 2015

dirk-thomas commented May 4, 2015

garaemon commented May 5, 2015

lrasinen commented May 5, 2015

aginika commented May 5, 2015

rethink-imcmahon commented May 6, 2015

lrasinen commented May 8, 2015

lrasinen commented May 8, 2015

dirk-thomas commented May 8, 2015

lrasinen commented May 11, 2015

dirk-thomas commented May 11, 2015

dirk-thomas commented May 11, 2015

lrasinen commented May 12, 2015

lrasinen commented May 12, 2015

dirk-thomas commented May 13, 2015

mgrrx commented May 30, 2016 •

edited

Loading

dirk-thomas commented May 31, 2016

mgrrx commented May 31, 2016

dirk-thomas commented Jun 2, 2016

mgrrx commented Jun 3, 2016

dirk-thomas commented Jan 27, 2017

alecive commented Jan 27, 2017

alecive commented Jan 27, 2017

dirk-thomas commented Jan 27, 2017

dirk-thomas commented Jul 21, 2017

dirk-thomas commented Aug 15, 2017

rosmaster leaves sockets in CLOSE_WAIT state #610

rosmaster leaves sockets in CLOSE_WAIT state #610

Comments

lrasinen commented Apr 27, 2015

dirk-thomas commented May 4, 2015

garaemon commented May 5, 2015

lrasinen commented May 5, 2015

aginika commented May 5, 2015

rethink-imcmahon commented May 6, 2015

lrasinen commented May 8, 2015

lrasinen commented May 8, 2015

dirk-thomas commented May 8, 2015

lrasinen commented May 11, 2015

dirk-thomas commented May 11, 2015

dirk-thomas commented May 11, 2015

lrasinen commented May 12, 2015

lrasinen commented May 12, 2015

dirk-thomas commented May 13, 2015

mgrrx commented May 30, 2016 • edited Loading

dirk-thomas commented May 31, 2016

mgrrx commented May 31, 2016

dirk-thomas commented Jun 2, 2016

mgrrx commented Jun 3, 2016

dirk-thomas commented Jan 27, 2017

alecive commented Jan 27, 2017

alecive commented Jan 27, 2017

dirk-thomas commented Jan 27, 2017

dirk-thomas commented Jul 21, 2017

dirk-thomas commented Aug 15, 2017

mgrrx commented May 30, 2016 •

edited

Loading