Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rosmaster leaves sockets in CLOSE_WAIT state #610 #834

Closed
wants to merge 18 commits into from

Conversation

mgrrx
Copy link
Contributor

@mgrrx mgrrx commented Jul 14, 2016

As already discussed in #610, this PR requires extensive testing. This PR replaces the HTTP backend of xmlrpclib by python-requests to fix the HTTP keep-alive problem that has been there for more than one year. This fix does NOT fix all CLOSE_WAIT issues that I observed on our robots (see #831 and #833), it only fixes the problem that is caused by the rosmaster <-> python xmlrpclib communication.

One side effect of the python-requests library ist that it throws completely different exceptions if e.g. the rosmaster is not reachable. I added some catch statements that re-throw the "original" exceptions to not break too many packages that catch the original exceptions.

@dirk-thomas
Copy link
Member

Can you please update the imports to something like from rosgraph.xmlrpc import ServerProxy to avoid touching the part where the class is being used. That should make the patch much smaller.

@@ -16,6 +16,7 @@

<run_depend>python-netifaces</run_depend>
<run_depend>python-rospkg</run_depend>
<run_depend>python-requests</run_depend>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the line one up to maintain alphabetical order.

@dirk-thomas
Copy link
Member

You mentioned that you "added some catch statements that re-throw the "original" exceptions to not break too many packages that catch the original exceptions". Are there any cases where the modified implementation raises different exceptions or behaves differently? Are you expecting to still break some use cases?

# We could add more exception mappings here:
# - requests.exceptions.InvalidURL -> socket.gaierror?
# - requests.exceptions.InvalidSchema -> socket.gaierror?
raise exc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the requests library throws totally different exceptions. With requests you'll get an exception derived from requests.RequestException (http://docs.python-requests.org/en/master/_modules/requests/exceptions/#RequestException) where the standard implementation would throw socket.error or maybe also httplib.HTTPExcetion. I added statements to throw a socket.timeout exception for timeouts and socket.error if e.g. the remote port is closed. As you can see in the comment, there are more exceptions possible like InvalidURL or InvalidSchema. If you think that adding this compatibility layer makes sense I can further extend this mapping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change should address some serious problems in the current implementation I think it would be great to add. One critical question will be which ROS distro to target though. This PR is made against Indigo which I would almost rule out since it seems to be a too drastic change with too much chance for regressions for a distro already released more then two years ago.

It good target Kinetic but it would still take a lot of testing (hopefully by multiple parties) to ensure that it doesn't break anything.

The most conservative would be to only aim for the to be released L-turtle but that would mean a long time until it is actually being used (and tested in the field).

Therefore I think Kinetic is a reasonable target. We should try to maintain as much of the behavior (with the same exceptions) as reasonably possible. So if you see other extensions not being mapped yet it would be great to add them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use just raise in this line (without the explicit argument). That carries the intention better imo.

@mikepurvis
Copy link
Member

@mgrrx This needs to be moved to target kinetic-devel. I would do it for you, but I don't have write access to your fork. For now going to remove the hitlist label until it merges cleanly (it's messing up some automation I'm working on).

@jspricke
Copy link
Member

I will take care.

@jspricke jspricke changed the base branch from indigo-devel to kinetic-devel January 31, 2017 19:56
@jspricke
Copy link
Member

rebased to kinetic-devel

@dirk-thomas
Copy link
Member

@ros-pull-request-builder retest this please

@mgrrx
Copy link
Contributor Author

mgrrx commented Feb 8, 2017

@ros-pull-request-builder retest this please

@dirk-thomas
Copy link
Member

dirk-thomas commented Feb 9, 2017

I just merged #981, which fixed the failing tests on the kinetic-devel branch. Therefore this should hopefully pass now...

@ros-pull-request-builder retest this please

@dirk-thomas
Copy link
Member

The tests fail because the master api now raises a requests.packages.urllib3.exceptions.NewConnectionError if the master is not available. As you already suspected in the original description this needs to be updated to handle the new exception types.

@mgrrx
Copy link
Contributor Author

mgrrx commented Feb 9, 2017

I'll have a look tomorrow, thank you!

mikepurvis added a commit to clearpathrobotics/ros_comm that referenced this pull request Feb 10, 2017
efernandez pushed a commit to clearpathrobotics/ros_comm that referenced this pull request Feb 23, 2017
@mikepurvis
Copy link
Member

Okay, so we've now been running this for a few weeks, and we're having issues in our logs with a lot of messages that look like this coming from some rospy processes:

[urllib3.connectionpool][INFO] 2017-02-27 18:15:03,214: Starting new HTTP connection (1): localhost
[urllib3.connectionpool][INFO] 2017-02-27 18:15:03,218: Starting new HTTP connection (1): localhost
[urllib3.connectionpool][INFO] 2017-02-27 18:15:03,223: Starting new HTTP connection (1): localhost
[urllib3.connectionpool][INFO] 2017-02-27 18:15:03,228: Starting new HTTP connection (1): localhost

These seem to be originating from the version of urllib3 that's bundled with Requests, specifically this line:

https://github.com/shazow/urllib3/blob/1.8/urllib3/connectionpool.py#L171

Now, the wrinkle in all this is that we're running all our robots still on Ubuntu Trusty, and we're using the default system version of Requests, which is 2013 vintage. So it's possible this change would be perfectly fine with the Requests that comes with Ubuntu Xenial, but we don't have a great way of testing that at the moment, short of backporting it (which we may well do anyway).

@mikepurvis mikepurvis removed the hitlist label Mar 1, 2017
@mikepurvis
Copy link
Member

Dropping testing from this for now as we're entering a release sprint— in a few weeks, I'll add it back in in conjunction with testing on an updated python-requests.

@furushchev
Copy link
Contributor

@mikepurvis Any update? I'm very happy if this is merged.

@furushchev
Copy link
Contributor

As I mentioned at #1024, this pull request breaks behavior when rosparam includes non-ascii characters.
Since I could not send pull request to this branch, I had to create another pull request. Sorry for bothering.
Please see explanation at #1024 for detail, and review! :)

@k-okada
Copy link
Contributor

k-okada commented Apr 5, 2017

Hi, we have been testing this on Ubuntu Trusty machine and works great for a few days, so it's ok to merge this.
Without this, this we have encountered CLOSE_WAIT trouble every 2 hours (because of our system requires re-start roslaunch often)

@dirk-thomas
Copy link
Member

@mgrrx Can you please take a look at the additional patches @furushchev added in #1024 and consider to merge them into your branch in order to add them to this PR.

@k-okada Previous testing has shown some regressions which seemed to be related to this patch. So I don't consider this to be ready to be merge at the moment even if it addresses the CLOSE_WAIT problem for you.

@mikepurvis
Copy link
Member

We're now equipped to install an updated requests library onto our bots (via catkin_tools_python), so I'll try to set that up in the next day or two. Sorry for letting this drop.

@furushchev
Copy link
Contributor

Hi, after I tried this patch on our robot and confirmed that with this patch there is no more CLOSE_WAIT after several hundreds spawning / killing nodes.
However I also found 3 points that we need to care about.

  • If roscore applied this patch is spawned, some kind of nodes crashes without any errors.
  • Also I felt the performance is not so high as the original one at some timing (e.g. rosnode list slightly takes more time than rosmaster without the patch).
  • rosnode info /some_node returns all publishers and subscribers, but does not returns their connections

@furushchev
Copy link
Contributor

furushchev commented Apr 15, 2017

Hi there,

I further investigated and I found another approach to maintain one rosmaster without crash.
I sent a TCP packet setting ACK flag to all source ports of CLOSE_WAIT connections without payload, and then rosmaster closes the port.
So, if you don't want to change the code base of rosmaster, I'd like to propose to add a function to monitor the number of CLOSE_WAIT connections and ask rosmaster to close them.

And here is python code for this. It looks very dirty and I totally don't have an idea how to integrate this to ros repository though:
https://gist.github.com/8a4b0fa5f7ffe9ff5ea0c69d6c285cc4

@dirk-thomas
Copy link
Member

This should be addressed by #1104.

@dirk-thomas
Copy link
Member

I will close this for now assuming this patch isn't needed anymore. Please feel to comment and the ticket can be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants