Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rosmaster leaves sockets in CLOSE_WAIT state #610

Closed
lrasinen opened this issue Apr 27, 2015 · 26 comments
Closed

rosmaster leaves sockets in CLOSE_WAIT state #610

lrasinen opened this issue Apr 27, 2015 · 26 comments

Comments

@lrasinen
Copy link
Contributor

Somehow related to #495, but not reproducible as easily.

In some scenarios, likely involving a node (or nodes) restarting, rosmaster can leave sockets in a CLOSE_WAIT state, eventually exhausting the limit on open file descriptors and becoming unresponsive.

Our system has about 30 nodes and 150 topics, and when the system is run without restarting, there is a steady creep upwards in the amount of CLOSE_WAIT sockets in systemwide monitoring. Detailed examination assigns most of these to the rosmaster process. See picture.

image

The sudden jump in the image has not been fully explained but it's likely associated with a hardware problem that caused parts of the system to restart repeatedly. Still, the trend is obvious.

After diagnosing the problem we've been trying to reproduce it with a simpler setup and/or collect logs, and the results are inconclusive. After comparing the rosmaster logs with lsof output, the leaks do appear to be related to ServerProxy objects.

It does not seem to matter whether the node uses roscpp or rospy, or whether it's a publisher-only or a subscriber-only. There are even instances of CLOSE_WAIT sockets being associated with nodes that do not publish or subscribe anything (they just read some parameters on startup).

We've run a belt-and-suspenders approach since hitting the limit, which involves a) nightly restarts and b) keeping the ServerProxy cache trimmed (patch below).

The nightly restarts have kept us from hitting the limit, but we're planning to run without them for a while to see if the ServerProxy patch helps. The default daily cycle already looks different:

image

The spikes are due to rosbag restarting every hour so we get bags to manageable size.

When there is a leak, it seems to be contained:

image

(These are system-wide numbers, rosmaster has 90 CLOSE_WAIT sockets)

--- util.py.old 2015-04-22 12:23:11.054085143 +0300
+++ util.py.new 2015-04-22 12:23:18.418073585 +0300
@@ -45,7 +45,12 @@
 except ImportError:
     from xmlrpclib import ServerProxy

-_proxies = {} #cache ServerProxys
+import collections
+import threading
+_proxies = collections.OrderedDict()
+_lock = threading.Lock()
+N = 100
+
 def xmlrpcapi(uri):
     """
     @return: instance for calling remote server or None if not a valid URI
@@ -56,11 +61,23 @@
     uriValidate = urlparse(uri)
     if not uriValidate[0] or not uriValidate[1]:
         return None
-    if not uri in _proxies:
-        _proxies[uri] = ServerProxy(uri)
+    # Experimental bit
+    with _lock:
+        proxy = _proxies.get(uri, None)
+        if proxy is None:
+            proxy = ServerProxy(uri)
+        else:
+            # OrderedDict requires a deletion to update order
+            del _proxies[uri]
+        _proxies[uri] = proxy
+        # Trim to size
+        while len(_proxies) > N:
+            _proxies.popitem(last=False)
+
     return _proxies[uri]


 def remove_server_proxy(uri):
-    if uri in _proxies:
-        del _proxies[uri]
+    with _lock:
+        if uri in _proxies:
+            del _proxies[uri]
@dirk-thomas
Copy link
Member

Please try to provide a reproducible example. Without that I simply can't change core logic like this. Such code changes need to be "validated" that they actually address the problem.

Also if you propose code changes do not past the diff in a ticket. Please use a PR for that instead. But your current patch would likely not be accepted - hard coding a fixed upper limit of 100 proxies looks very bogus to me.

@garaemon
Copy link

garaemon commented May 5, 2015

I have similar problem described in #325

@lrasinen
Copy link
Contributor Author

lrasinen commented May 5, 2015

@dirk-thomas I understand the situation you're in. So far I've spent about a week in trying to reproduce this reliably in laboratory conditions, and all I've managed to do is crash my computer twice by launching thousands of instances by accident :)

Having said that, I think the data does point to a leak in ServerProxy caching. The diff in question wasn't intended for merging as such, it was more for "this is what we tried and it helps for us". For example, the value N=100 was selected after careful analysis by pulling a nice round number out of a hat.

My pet theory for the root cause that items in _proxies only get dropped if an API gets bumped. So a node comes up, rosmaster sends it some sort of update and then the node shuts down, leaving the ServerProxy alive and the socket in CLOSE_WAIT state.... and then master never needs to send any more updates so it'd notice the other end has gone cold.

Thus having a limited number of proxies help; if one takes that to the limit, then disabling caching entirely helps too. For example, there's another xmlrpcapi() function in rospy/core.py which doesn't do any caching.

I'll keep up my attempts to get a lab-reproducible set of nodes; I'm open to any suggestions on how to tease out the bug more easily.

@aginika
Copy link

aginika commented May 5, 2015

We have similar problem on baxter too...

@rethink-imcmahon
Copy link

@aginika please let me know the symptoms of your issue on Baxter

@lrasinen
Copy link
Contributor Author

lrasinen commented May 8, 2015

I did some experiments, and it's easy to show that ServerProxy objects pile up. I ran these two loops:

$ while rostopic pub -1 foo std_msgs/String "hi there"; do :; done
$ while : ; do rostopic echo -n 1 foo; done

combined with a debug log entry in util.py that prints out the size of the _proxies dict on every call to xmlrpcapi(), and lo and behold, it does creep up every 3 seconds. However, this doesn't yet reproduce the CLOSE_WAIT situation. I'll try to devise a nastier test later today.

@lrasinen
Copy link
Contributor Author

lrasinen commented May 8, 2015

Contrary to what I've said earlier, rospy vs roscpp might actually be significant. The Python HTTP server responds with HTTP/1.0, so there's no keep-alive and the XML-RPC client in rosmaster knows this.

I'll write a simple C++ node when I next have time to dedicate to this issue. Might be a week or two, though, so I don't mind if someone beats me to it :)

@dirk-thomas
Copy link
Member

@lrasinen I can't reproduce a problem with your two line command line invocations of rostopic. I also don't think that rospy is affected at all and the command line tool use rospy. Are you using the latest released packages of ROS on Indigo? Can you please post your exact changes (preferably in a ready-to-run fork) which reports the increasing proxy counts?

Regarding the increasing number of sockets in CLOSE_WAIT state I think I found an easy way to reproduce it. I did the following:

When doing the same with the rospy_tutorials talker / listener launch file the problem does not exist for me.

@lrasinen
Copy link
Contributor Author

Gah, sorry for omitting the version details earlier. Anyway, this is on Ubuntu Trusty and the rosmaster package is "1.11.10-0trusty-20141229-2145-+0000". As far as I can tell, that's the latest. If not, I'll have a chat with our IT...

My exact change was this oneliner:

--- util.py.orig    2015-05-08 11:12:22.926255672 +0300
+++ util.py 2015-05-08 14:50:23.494047619 +0300
@@ -58,6 +58,7 @@
         return None
     if not uri in _proxies:
         _proxies[uri] = ServerProxy(uri)
+    print "proxies", len(_proxies), "uri", uri
     return _proxies[uri]

(Also at https://github.com/lrasinen/ros_comm/tree/proxy-debug for merging convenience; but that code just has the patch applied to it and not tested as such)

You're right, the shell examples don't have the CLOSE_WAIT problem, but with the above change you can see the slow growth in the amount of proxies. That's what led me to reconsider the difference between roscpp and rospy.

I had a look at the talker / listener launch and that looks very similar to the conditions we're experiencing (where the role of the Ctrl-C is taken by a camera node crashing). I'm willing to say it's a good reproduction of the problem.

@dirk-thomas
Copy link
Member

I don't think that the increasing number of cached proxies is the primary problem. The number also increases when running the rospy_tutorials talker_listener launch file but that does not leave any sockets in CLOSE_WAIT state.

@dirk-thomas
Copy link
Member

While looking into the problem a bit more I found the following weird behavior:

  • when running the C++ talker_listener repeatedly against a roscore from Debian packages it shows the increasing number of sockets in CLOSE_WAIT as mentioned before
  • when running it against a roscore from a devel space (where all ROS core packages are in the workspace) the socket count stays at zero
  • when running it against a roscore from an install space it behaves the same as from Debian packages

In all cases the launch file was started from the same environment (devel space) and the sources in the workspace had the same version as the Debian packages.

@lrasinen
Copy link
Contributor Author

My theory about the role of server proxies goes like this.

  1. client uses rospy
    The client responds to the HTTP request with HTTP/1.0, causing httplib to close the connection immediately. The socket is closed and the leaking proxies only leak their miniscule bit of memory (but since the connection is closed, there's not much benefit from caching either)

  2. client uses roscpp
    The client responds with HTTP/1.1, causing httplib to leave the connection open in case it's used again. When the client goes away, socket stays behind.

Anyway, to test this I wrote even more debugging at https://github.com/lrasinen/ros_comm/tree/moalv (Mother of all layering violations)

My afternoon schedule looks busy so I can't test the roscpp part of the above theory today, I'll try to get back to this later this week though.

@lrasinen
Copy link
Contributor Author

I was able to get a few CLOSE_WAIT sockets by running 100 copies of rosrun roscpp_tutorials anonymous_listener in one shell, and rostopic pub -1 chatter std_msgs/String "hi there" in another.

In the ServerProxy debug output we can see file descriptor 7:
uri http://tardis:49228/, fd 7

and its counterpart in lsof -ni listing:
rosmaster 15908 lrasinen 7u IPv4 35406754 0t0 TCP 127.0.0.1:56972->127.0.1.1:49228 (CLOSE_WAIT)

But most of the proxies (currently about 400) don't have any attached sockets so there might be some other factor still at play. Writing to a closed socket will eventually fail and close the connection, so perhaps there's some list that gets pruned before that occurs.

@dirk-thomas
Copy link
Member

I disabled the keep-alive functionality for the xml rpcs (

) but the problem is still the same.

@mgrrx
Copy link
Contributor

mgrrx commented May 30, 2016

We observed the same problem on our robots. I finally debugged it down to a single line of code that has to be added in rosgraph (mgrrx@4f06033):

diff --git a/tools/rosgraph/src/rosgraph/xmlrpc.py b/tools/rosgraph/src/rosgraph/xmlrpc.py
index 7d9aad8..a9c3d52 100644
--- a/tools/rosgraph/src/rosgraph/xmlrpc.py
+++ b/tools/rosgraph/src/rosgraph/xmlrpc.py
@@ -76,6 +76,8 @@ def isstring(s):
         return isinstance(s, str)

 class SilenceableXMLRPCRequestHandler(SimpleXMLRPCRequestHandler):
+    protocol_version = 'HTTP/1.1'
+
     def log_message(self, format, *args):
         if 0:
             SimpleXMLRPCRequestHandler.log_message(self, format, *args)

Caching the ServerProxies in the rosmaster only makes sense if HTTP/1.1 is used. With HTTP/1.0, a connection to the rosmaster remains in CLOSE_WAIT until the entry in the _proxies dictionary gets deleted or by e.g. calling _proxies[uri]._ServerProxy__close() to close the socket after a timeout or something similar.

Update: Unfortunately this fix didn't solve all problems we observed. I now replaced the httplib backend of xmlrpclib by python-requests. In combination with the change from above I could reduce the number of CLOSE_WAIT connections to 1 on our system (~80 nodes). Previously, we had more than 100 CLOSE_WAIT connections when starting the robot.

Can you please verify if that fixes the issue. I'd like to open a pull request as soon as possible.

mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016
mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016
mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016
mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016
mgrrx added a commit to mgrrx/ros_comm that referenced this issue May 31, 2016
@dirk-thomas
Copy link
Member

Can you please clarify what exact change you are proposing.

@mgrrx
Copy link
Contributor

mgrrx commented May 31, 2016

My first attempt was to make sure HTTP/1.1 is used for every connection. mgrrx/ros_comm@4f06033 fixed this and eliminated all CLOSE_WAITs in my dummy world (playing around with talker/listener nodes from the tutorial packages). However, after testing on the real robot, I found out that it is still not working and connections remain in the CLOSE_WAIT state.

After a bit of googling and playing around with xmlrpclib and the underlaying httplib, I figured out that:

  • httplib's HTTPConnetion is not thread-safe
  • the HTTP/1.1 implementation is not that clear and people started writing httplib2 and python-requests to replace this library.

If I understand the httplib code correctly, connections are only closed if you are sending the header "connection: close". Although a client might close the connection, the socket remains in the CLOSE_WAIT state then.

I found some other projects that are using the requests library as the http backend for xmlrpclib. I implemented that in rosgraph and updated other packages that are using xmlrpclib mgrrx@f1f3340 . The subsequent commits update the tests and docstrings.

So my proposed solution is to use the requests library for http connections and to make sure HTTP/1.1 is used everywhere.

@dirk-thomas
Copy link
Member

Please see #371 for a related patch. If you could provide a PRs for these changes which passes all the cases which we blocking the previous PR that would be great. But such a change will require extensive testing and might only be considered for future ROS releases (maybe still for Kinetic but I am not sure about that).

@mgrrx
Copy link
Contributor

mgrrx commented Jun 3, 2016

My implementation passes the test cases mentioned in #371 without blocking any terminal. I totally agree with you that this needs extensive testing but I'll keep testing the modified core on our robots anyway before opening the PR.
It would be helpful if anyone could test the patches in kinetic as we are still using indigo…

mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 11, 2016
…nd is related to ros#610 as connections remained in CLOSE_WAIT.

Added a check for CLOSE_WAIT connections in the RegManager loop
mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 13, 2016
…nd is related to ros#610 as connections remained in CLOSE_WAIT.

Added a check for CLOSE_WAIT connections in the RegManager loop
mgrrx added a commit to mgrrx/ros_comm that referenced this issue Jul 14, 2016
@dirk-thomas
Copy link
Member

The referenced PRs are currently actively being tested. As soon as we have made sure that the changes don't introduce regressions they will be merged and released into Kinetic. Afterwards they will be considered for being backported to Jade / Indigo.

You could try the patches from the referenced PRs on your robot and report back in these about your experience. It always help to know if a patch does or doesn't work for more users.

@alecive
Copy link

alecive commented Jan 27, 2017

Thank you @dirk-thomas for the answer. So, to make it clear, what you suggest is to compile ros_comm from https://github.com/mgrrx/ros_comm/tree/fix/610 and see if it works for us, am I right?

In our Baxter machine we have ROS indigo installed system-wide from the debian packages (yes I don't like this either, but I inherited the robot from other people), so we would need to do the following:

  • Uninstall all the system packages
  • Compile and download ROS from source by using this tutorial : http://wiki.ros.org/indigo/Installation/Source
  • Look for the ros_comm package, and replace it with this version https://github.com/mgrrx/ros_comm/tree/fix/610
  • Recompile ROS to link against the new version of ros_comm
  • Recompile the Baxter code to link against the new version of ros_comm
  • Recompile our code to link against the new version of ros_comm
  • Cross fingers, and test

It's not a problem doing that , I just wanted to be sure that we do all the right steps in order to avoid having further problems down the road.

@alecive
Copy link

alecive commented Jan 27, 2017

Plus, @rethink-imcmahon correct me if I'm wrong, but we do not have any way to recompile manually the patched version of ROS on the Baxter machine. Am I right?

@dirk-thomas
Copy link
Member

Since the PR only touches Python code you don't need to compile anything. Simply clone the branch into your workspace (or create an empty one), build the ws, and source the result. That will put all the Python code on the PYTHONPATH in front of the system packages.

jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
jspricke pushed a commit to mgrrx/ros_comm that referenced this issue Jan 31, 2017
dirk-thomas pushed a commit that referenced this issue Feb 8, 2017
* rospy: switched from poll to epoll for tcpros. Fixes #107 for me and is related to #610 as connections remained in CLOSE_WAIT.
Added a check for CLOSE_WAIT connections in the RegManager loop

* using hasattr instead of dir

* Tests for #107, currently failing of course

* Simplified the code for selecting the Poller backend.
Added a fallback to select.poll if select.epoll is not available in
Linux. OS X will still use kqueue and Windows the noop operator.

* Removed copyrights from tests
Removed whitespace

* Rewrote the check for closed connections. The crucial part is now guarded but the lock is only acquired when connection list needs to be modified. This avoids unnecessary locking.

* Remove copyright
mikepurvis added a commit to clearpathrobotics/ros_comm that referenced this issue Feb 10, 2017
efernandez pushed a commit to clearpathrobotics/ros_comm that referenced this issue Feb 23, 2017
otamachan added a commit to otamachan/ros_comm that referenced this issue Jun 13, 2017
* Check CLOSE_WAIT sockets and close them whenever call xmlrpcapi
otamachan added a commit to otamachan/ros_comm that referenced this issue Jun 13, 2017
* Check and close CLOSE_WAIT socket whenever xmlrpcapi is called
@dirk-thomas
Copy link
Member

WIth #1104 being merged can you please double check that all cases from this ticket are being covered and that this can be closed.

@dirk-thomas
Copy link
Member

Assuming that the referenced PR addressed the problem I will close this. Please feel free to comment and this can be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants