Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HTTP for kernel < 4.16 #2132

Merged
merged 9 commits into from
Mar 11, 2021

Conversation

jikawa-az
Copy link
Contributor

@jikawa-az jikawa-az commented Feb 14, 2021

Fixes bug #2118

Linux kernels 4.15 and older encounter performance issues with HTTP/1.1
I have tested this with a customer's 4.14 kernel workflow and it addressed their performance issues.
Will need to be backported to Melodic.

Signed-off-by: Jesse Ikawa jikawa@amazon.com

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
@fujitatomoya
Copy link
Contributor

@ros-pull-request-builder please run the test again

tools/rosgraph/src/rosgraph/xmlrpc.py Outdated Show resolved Hide resolved
tools/rosgraph/src/rosgraph/xmlrpc.py Outdated Show resolved Hide resolved
tools/rosgraph/src/rosgraph/xmlrpc.py Outdated Show resolved Hide resolved
@jikawa-az
Copy link
Contributor Author

Thanks for the review! I will update the PR.

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nitpick, lgtm

tools/rosgraph/src/rosgraph/xmlrpc.py Outdated Show resolved Hide resolved
Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
@jikawa-az
Copy link
Contributor Author

Looks like the checks passed, if it looks good can we merge? Also whats the process for backport, should I just open another PR for melodic-devel?

@fujitatomoya
Copy link
Contributor

if it looks good can we merge?

i am good to go with this, but i am not maintainer... @sloretz @jacobperron friendly ping.

Also whats the process for backport, should I just open another PR for melodic-devel?

i think backport will be taken care of CI.

@jikawa-az
Copy link
Contributor Author

Awesome, thanks for all the help!!

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
@jinmenglei
Copy link
Contributor

@jikawa-az
if _support_http_1_1():
like this?

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
Signed-off-by: Jesse Ikawa <jikawa@amazon.com>
@jikawa-az
Copy link
Contributor Author

Hi @sloretz @jacobperron, when you get a chance may we please have a review? I would like to have this fix merged to address a customer's performance issues.
Thank you!

tools/rosgraph/src/rosgraph/xmlrpc.py Outdated Show resolved Hide resolved
Co-authored-by: Emerson Knapp <537409+emersonknapp@users.noreply.github.com>
@jikawa-az
Copy link
Contributor Author

Friendly ping: @sloretz @jacobperron

Copy link
Contributor

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. The change LGTM.

jacobperron pushed a commit that referenced this pull request Apr 6, 2021
Addressing performance issues described in #2118

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>

Co-authored-by: Emerson Knapp <537409+emersonknapp@users.noreply.github.com>
@schadocalex
Copy link
Contributor

schadocalex commented Jun 22, 2021

Update after @Crcodlus comment
#2118 (comment)

the memory usage of rosmaster unfortunately increases continuously and the memory is not going to be released anymore.

I have the exact same problem. Any use of the rosmaster API, even a simple getPid, increases the memory without any release.

After days of investigation, I found out it was caused by ros_comm itself and not something in my project. I did a git-bisect and "8b4089917fef19ab9fd0e00f11ae06235ed7380b is the first bad commit". I work at a company with robots, and all robots updated in >= 1.15.10 with a linux kernel < 4.16 have the bug. In our project it increases the memory by 3GB/hour and it freezes sometimes the communication between nodes.

I will have to force 1.15.9 or update linux kernel on all affected robots, I think that is a bit critical!

@emersonknapp
Copy link
Contributor

That's not good!
It's very strange - have you tried building from source, do you know if it's the HTTP protocol version, or perhaps the daemon_threads option, that is introducing the memory issue? Or only the combination of both together?

This change to HTTP1.1 that went in from Kinetic to Lunar made it so that our service calls would reliably fail - is this application old enough that you happened to be running it on Kinetic? If so, did you experience any performance issues when you upgraded to Lunar/Melodic/Noetic?

This issue is especially hard to test, because even containerizing doesn't help since Docker shares the host kernel.

@schadocalex
Copy link
Contributor

schadocalex commented Jun 23, 2021

Yes I was building from sources to test the 1.15.9 version and then to do the git-bisect.

I've just tested, and it is the daemon_threads option. When I force it to True, the memory issue disappears. We may force it all the time, I'm not really sure why it was conditioned by the http 1.1 support.

I can't tell you about any performance issue, I didn't notice anything in the past (we were on Kinetic then Melodic and now Neotic) since this memory issue (that also triggers communication timeouts between some nodes).

@schadocalex
Copy link
Contributor

Even if the code has changed, I think it is related to that: https://salsa.debian.org/debian/python-prometheus-client/commit/5aa256d8aab3b81604b855dc03f260342fc391fb

@emersonknapp
Copy link
Contributor

it is the daemon_threads option

That's good to know! I see no reason why daemon_threads should depend on the HTTP version. The only reason the changes were grouped together was to "switch off" the change in https://github.com/ros/ros_comm/pull/1287/files, which we observed introduced the problem. Perhaps you could open a followup PR making daemon_threads non-conditional?

@schadocalex
Copy link
Contributor

Sure: #2165

@peci1
Copy link
Contributor

peci1 commented Sep 1, 2021

Please, see #2182 - this fix has unwanted side-effects on 4.x systems: it does not allow setting parameters larger than 32 kB from roscpp.

@peci1
Copy link
Contributor

peci1 commented Nov 9, 2021

@jikawa-az would you be able to write down a way to tell whether the performance hit you observed is resolved or not? there are several other issues around the change this PR made and it would be good to know that solutions to those issues do not break the improvement your PR brought

@peci1
Copy link
Contributor

peci1 commented Nov 9, 2021

@jikawa-az or is the test app from #2118 still a good test for this?

@peci1
Copy link
Contributor

peci1 commented Nov 10, 2021

Okay, I confirm that the test app from #2118 still fails on a Jetson with kernel 4.9 if _support_http_1_1() is forced to return True. When it return False (as is done in this PR), the test app can run for a long time without failure. However, there's still issue #2182 caused by this PR.

@emersonknapp
Copy link
Contributor

Thanks for confirming that the test app from #2118 still fails. That's the only repro we currently have for the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants