-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leaks #197
Fix memory leaks #197
Conversation
@martins-mozeiko Please provide more information about how you debugged / tested this - at best a reproducible example. |
One memory leak happens in Simple test to demonstrate this is to run demo listener: Here's an example from my machine (Ubuntu 16.04). First I modified talker node to produce more message to get memory usage grow faster - set timer_period to 0.001 and disable logging to stdout so it does not slow down on printing messages to terminal. Now check the memory usage immediately after I run the listener node:
After running for ~10min:
And it keeps growing... Other memory leak happens in Reproducing memory leak for timer is easy with Same a before I run modified talker with timer_period set to 0.001 to produce much more messages so it is easier to see how memory grows. Here's the memory usage immediately after I launch it:
Here's the memory usage after ~10 min:
And it keeps growing... After this patch is applied to rclpy, the memory usage for these two demo nodes does not change at all regardless of how long you run. |
I just tried to reproduce your example. I modified the talker to fire the timer every ms and removd the log call from the talker (actually print a message every one thousand msgs to see progress). When I run both nodes they use a constant memory amount of 21.2 MB (talker) and 21.4 MB (listener) - the same after running for several minutes. Can you please clarify which state of the code you are basing your testing on. Is it the latest release (Ardent) or the current master? Please also mention the Python version you are using - since I am testing on Bionic I am using 3.6.5. |
I'm pretty sure this is Python version independent. For this demo talker/listener example I was using Ubuntu 16.04 with Python 3.5.2 and ros2 build from current master. I was running it on x86_64 desktop. Just now I tried with Ubuntu 18.04 that has Python 3.6.5 - same result, both talker and listener are leaking. But we have also reproduced this on Yocto based build with Python 3.5.2, using Ardent (0.4.0 tag) and running on armv7 Raspberry Pi 3 with custom ros2 nodes. Just to be clear, my modified talker node looks like this:
Here's a screen recording of session that gives me memory leak on Ubuntu 18.04 with current master of ros2. User is root because I'm running this inside docker: https://asciinema.org/a/a5Gg2GvjhBjkYPVOXiygWtOAY |
rclpy/src/rclpy/_rclpy.c
Outdated
PyObject* obj = PyLong_FromUnsignedLongLong((uint64_t) & struct_ptr[idx]->impl); \ | ||
if (obj) { \ | ||
PyList_Append(entity_ready_list, obj); \ | ||
Py_DECREF(obj); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for this part of the patch.
Py_DECREF(obj); \ | ||
} else { \ | ||
Py_DECREF(entity_ready_list); \ | ||
return NULL; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else
changes behavior. Before the code didn't check the value returned by PyLong_FromUnsignedLongLong
(sure, not great). With the patch it suddenly return NULL
in that case without any kind of error reporting. I neither think that change is necessary to address the memory leak described in this ticket nor that it should be done like this.
Therefore I would suggest to just remove the else
block from the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure PyLong_FromUnsignedLongLong is setting exception when it returns NULL. That's how most CPython API functions work - if they fail they set global exception which is "raised" when C function returns and return value will not be used at all.
I feel that by ignoring error case and just skip adding element to list won't inform user of this API that something "bad" happened. Are you sure its ok to so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one place where PyLogn_FromUnsignedLongLong can fail and return NULL: https://github.com/python/cpython/blob/6405feecda6a5d5dd7a4240eb3054a2676ed29b1/Objects/longobject.c#L327
It calls _PyLong_New which always sets exception in cases it returns NULL: https://github.com/python/cpython/blob/6405feecda6a5d5dd7a4240eb3054a2676ed29b1/Objects/longobject.c#L203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a reproducible example for that exact case (so that PyLong_FromUnsignedLongLong
does return NULL
)?
We are just in the very last phase before the release and don't have much bandwidth. Since this does fix the memory leaks it should get "in". There is just not much time to do additional testing anymore. It should indeed be investigated and addressed better as mentioned in #201.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer into how it is being called (
rclpy/rclpy/src/rclpy/_rclpy.c
Lines 2313 to 2331 in 869aa95
PyObject * entity_ready_list = PyList_New(0); | |
if (0 == strcmp(entity_type, "subscription")) { | |
GET_LIST_READY_ENTITIES(subscription) | |
} else if (0 == strcmp(entity_type, "client")) { | |
GET_LIST_READY_ENTITIES(client) | |
} else if (0 == strcmp(entity_type, "service")) { | |
GET_LIST_READY_ENTITIES(service) | |
} else if (0 == strcmp(entity_type, "timer")) { | |
GET_LIST_READY_ENTITIES(timer) | |
} else if (0 == strcmp(entity_type, "guard_condition")) { | |
GET_LIST_READY_ENTITIES(guard_condition) | |
} else { | |
PyErr_Format(PyExc_RuntimeError, | |
"'%s' is not a known entity", entity_type); | |
return NULL; | |
} | |
return entity_ready_list; | |
} |
NULL
which matches the logic of the caller. The same happens already when the entity_type
is of invalid type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry, used wrong account above)
This is harder to reproduce. From looking in Python code I can see this can happen only when Python runs out of memory (malloc fails). It will call PyErr_NoMemory() and return NULL. This means Python will raise MemoryError exception when control gets back to python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -2413,6 +2418,7 @@ rclpy_take(PyObject * Py_UNUSED(self), PyObject * args) | |||
"unable to retrieve destroy_ros_message function, type_support mustn't have been imported"); | |||
|
|||
void * taken_msg = convert_from_py(pymsg); | |||
Py_DECREF(pymsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for this part of the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the current patch addresses two major memory leaks.
The follow up work has been ticketed in #201.
@dirk-thomas FYI I found one more memory leak in Python code generated by rosidl. It's a pretty big one actually. I'll be submitting new pull request for rosidl shortly. |
ci looks happy 👍 |
This fixes two memory leaks in rclpy.
Python Object reference count was not decreased where it is needed.