Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks #197

Merged
merged 6 commits into from
Jun 29, 2018
Merged

Fix memory leaks #197

merged 6 commits into from
Jun 29, 2018

Conversation

martins-mozeiko
Copy link
Contributor

This fixes two memory leaks in rclpy.
Python Object reference count was not decreased where it is needed.

@dirk-thomas dirk-thomas added the in review Waiting for review (Kanban column) label Jun 28, 2018
@mikaelarguedas
Copy link
Member

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@dirk-thomas
Copy link
Member

@martins-mozeiko Please provide more information about how you debugged / tested this - at best a reproducible example.

@martins-mozeiko
Copy link
Contributor Author

One memory leak happens in rclpy_take function. It does not release pymsg python object. This function is called for all incoming messages. This means - anything that subscribes will leak memory.

Simple test to demonstrate this is to run demo listener: ros2 run demo_nodes_py listener.
Launch talker in separate terminal and start watching memory usage of listener process. It will grow unbounded.

Here's an example from my machine (Ubuntu 16.04). First I modified talker node to produce more message to get memory usage grow faster - set timer_period to 0.001 and disable logging to stdout so it does not slow down on printing messages to terminal. Now check the memory usage immediately after I run the listener node:

$ grep VmRSS /proc/`pgrep -f demo_nodes_py/listener`/status
VmRSS:     32700 kB

After running for ~10min:

$ grep VmRSS /proc/`pgrep -f demo_nodes_py/listener`/status
VmRSS:     80112 kB

And it keeps growing...

Other memory leak happens in rclpy_get_ready_entities function which is called multiple times from here. If any of these calls will return non-empty list, it will leak memory for each element that is added to list. In my case I was using timer in a ros node, so I saw memory being allocated because of call in line 416. But this will happen for other calls as well (not exactly sure how to reproduce those).

Reproducing memory leak for timer is easy with ros2 run demo_nodes_py talker. It will use timer to publish messages.

Same a before I run modified talker with timer_period set to 0.001 to produce much more messages so it is easier to see how memory grows. Here's the memory usage immediately after I launch it:

$ grep VmRSS /proc/`pgrep -f demo_nodes_py/talker`/status
VmRSS:     32360 kB

Here's the memory usage after ~10 min:

$ grep VmRSS /proc/`pgrep -f demo_nodes_py/talker`/status
VmRSS:     59392 kB

And it keeps growing...

After this patch is applied to rclpy, the memory usage for these two demo nodes does not change at all regardless of how long you run.

@dirk-thomas
Copy link
Member

I just tried to reproduce your example. I modified the talker to fire the timer every ms and removd the log call from the talker (actually print a message every one thousand msgs to see progress). When I run both nodes they use a constant memory amount of 21.2 MB (talker) and 21.4 MB (listener) - the same after running for several minutes.

Can you please clarify which state of the code you are basing your testing on. Is it the latest release (Ardent) or the current master? Please also mention the Python version you are using - since I am testing on Bionic I am using 3.6.5.

@dirk-thomas dirk-thomas added the more-information-needed Further information is required label Jun 28, 2018
@martins-mozeiko
Copy link
Contributor Author

I'm pretty sure this is Python version independent.

For this demo talker/listener example I was using Ubuntu 16.04 with Python 3.5.2 and ros2 build from current master. I was running it on x86_64 desktop.

Just now I tried with Ubuntu 18.04 that has Python 3.6.5 - same result, both talker and listener are leaking.

But we have also reproduced this on Yocto based build with Python 3.5.2, using Ardent (0.4.0 tag) and running on armv7 Raspberry Pi 3 with custom ros2 nodes.

Just to be clear, my modified talker node looks like this:

class Talker(Node):

    def __init__(self):
        super().__init__('talker')
        self.i = 0
        self.pub = self.create_publisher(String, 'chatter')
        timer_period = 0.001   ### MODIFIED HERE
        self.tmr = self.create_timer(timer_period, self.timer_callback)

    def timer_callback(self):
        msg = String()
        msg.data = 'Hello World: {0}'.format(self.i)
        self.i += 1
        #self.get_logger().info('Publishing: "{0}"'.format(msg.data)) ### COMMENTED OUT
        self.pub.publish(msg)

Here's a screen recording of session that gives me memory leak on Ubuntu 18.04 with current master of ros2. User is root because I'm running this inside docker: https://asciinema.org/a/a5Gg2GvjhBjkYPVOXiygWtOAY

PyObject* obj = PyLong_FromUnsignedLongLong((uint64_t) & struct_ptr[idx]->impl); \
if (obj) { \
PyList_Append(entity_ready_list, obj); \
Py_DECREF(obj); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for this part of the patch.

Py_DECREF(obj); \
} else { \
Py_DECREF(entity_ready_list); \
return NULL; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else changes behavior. Before the code didn't check the value returned by PyLong_FromUnsignedLongLong (sure, not great). With the patch it suddenly return NULL in that case without any kind of error reporting. I neither think that change is necessary to address the memory leak described in this ticket nor that it should be done like this.

Therefore I would suggest to just remove the else block from the patch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure PyLong_FromUnsignedLongLong is setting exception when it returns NULL. That's how most CPython API functions work - if they fail they set global exception which is "raised" when C function returns and return value will not be used at all.

I feel that by ignoring error case and just skip adding element to list won't inform user of this API that something "bad" happened. Are you sure its ok to so?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one place where PyLogn_FromUnsignedLongLong can fail and return NULL: https://github.com/python/cpython/blob/6405feecda6a5d5dd7a4240eb3054a2676ed29b1/Objects/longobject.c#L327
It calls _PyLong_New which always sets exception in cases it returns NULL: https://github.com/python/cpython/blob/6405feecda6a5d5dd7a4240eb3054a2676ed29b1/Objects/longobject.c#L203

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reproducible example for that exact case (so that PyLong_FromUnsignedLongLong does return NULL)?

We are just in the very last phase before the release and don't have much bandwidth. Since this does fix the memory leaks it should get "in". There is just not much time to do additional testing anymore. It should indeed be investigated and addressed better as mentioned in #201.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer into how it is being called (

rclpy/rclpy/src/rclpy/_rclpy.c

Lines 2313 to 2331 in 869aa95

PyObject * entity_ready_list = PyList_New(0);
if (0 == strcmp(entity_type, "subscription")) {
GET_LIST_READY_ENTITIES(subscription)
} else if (0 == strcmp(entity_type, "client")) {
GET_LIST_READY_ENTITIES(client)
} else if (0 == strcmp(entity_type, "service")) {
GET_LIST_READY_ENTITIES(service)
} else if (0 == strcmp(entity_type, "timer")) {
GET_LIST_READY_ENTITIES(timer)
} else if (0 == strcmp(entity_type, "guard_condition")) {
GET_LIST_READY_ENTITIES(guard_condition)
} else {
PyErr_Format(PyExc_RuntimeError,
"'%s' is not a known entity", entity_type);
return NULL;
}
return entity_ready_list;
}
) I think I am fine with keeping the else block (assuming the call sets an error) and return NULL which matches the logic of the caller. The same happens already when the entity_type is of invalid type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry, used wrong account above)

This is harder to reproduce. From looking in Python code I can see this can happen only when Python runs out of memory (malloc fails). It will call PyErr_NoMemory() and return NULL. This means Python will raise MemoryError exception when control gets back to python code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the else block (2031c48) after looking closer how it is being used and addressed a missing decref in the calling code (3b993a8) as well as eliminated the unreachable return NULL (6ca1439). I hope that is fine for now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -2413,6 +2418,7 @@ rclpy_take(PyObject * Py_UNUSED(self), PyObject * args)
"unable to retrieve destroy_ros_message function, type_support mustn't have been imported");

void * taken_msg = convert_from_py(pymsg);
Py_DECREF(pymsg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for this part of the patch.

@dirk-thomas dirk-thomas removed the more-information-needed Further information is required label Jun 28, 2018
Copy link
Member

@dirk-thomas dirk-thomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the current patch addresses two major memory leaks.

The follow up work has been ticketed in #201.

@dirk-thomas
Copy link
Member

Local testing looks promising but here are a few CI builds just ot make sure

Just testing rclpy: Build Status

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@martins-mozeiko
Copy link
Contributor Author

@dirk-thomas FYI I found one more memory leak in Python code generated by rosidl. It's a pretty big one actually. I'll be submitting new pull request for rosidl shortly.

@mikaelarguedas
Copy link
Member

ci looks happy 👍

@dirk-thomas dirk-thomas merged commit 1b731ad into ros2:master Jun 29, 2018
@dirk-thomas dirk-thomas removed the in review Waiting for review (Kanban column) label Jun 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants