Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in callback group pointer vector iterator #813

Closed
guillaumeautran opened this issue Aug 2, 2019 · 3 comments
Closed

Crash in callback group pointer vector iterator #813

guillaumeautran opened this issue Aug 2, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@guillaumeautran
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu 16.04
  • Installation type:
    • From Source (Crystal P4) But Master branch has the same issue (from code inspection)
  • Version or commit hash:
    • HEAD of Master for RCLCPP
  • DDS implementation:
    • N/A
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

There is 2 separate issues here:

  1. A sever crash can occur in the RCLCPP code when a thread is iterating over a callback_group while another thread is creating a new subscription (or timer, service, etc). The iterator over the actual callback group pointer vector is done unprotected by a lock causing a potential concurrent access leading to a crash.
  2. A memory leak is also present in the callback group pointer vector storing weak reference to subscriptions. If the subscription gets destroyed, the weak reference becomes invalid, however, that weak reference object is never removed from the callback group pointer vector. This overtime leads to memory leaks when subscriptions are added / removed dynamically.

Reproduction step is by using the multithreaded executor as well as separate regular posix threads to create new topic subscriptions. The actual code change is self explanatory.

@dirk-thomas dirk-thomas added the bug Something isn't working label Aug 2, 2019
@dirk-thomas
Copy link
Member

Thank you for the report.

The actual code change is self explanatory.

Please consider to contribute a pull request.

guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 2, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 2, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 2, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
@guillaumeautran
Copy link
Contributor Author

@dirk-thomas I was too slow :) PR attached.

guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 2, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 7, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 21, 2019
1. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
  The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return.

2. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

3. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 28, 2019
1. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

2. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 28, 2019
1. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
  The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

2. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
guillaumeautran added a commit to clearpathrobotics/rclcpp that referenced this issue Aug 28, 2019
1. Memory leak in callback group weak reference vectors
  There is leak in the callback group weak reference vectors that occurs when a weak reference becomes invalid (subscription is dropped, service deleted, etc).
The now obsolete weak pointer reference is never deleted from the callback group pointer vector causing the leak.
The resolution of this problem is implemented by scanning and deleting expired weak pointer at the time of insertion of a new weak pointer into the callback group vectors.
This approach is the lowest computational cost to purging obsolete weak pointers.

2. Crash in iterator for callback group pointer vectors
  This problem exists because a reference to the callback group pointer vector is provided as a return value to facilitate loop iterator.
This is a significant crash root cause with a multithreaded executor where a thread is able to add new subscription to the callback group.
The crash is caused by a concurrent modification of the weak pointer vector while another thread is iterating over that same vector resulting in a crash.

3. Mutually exclusive callback group hangs
  The root cause for this issue is due to a combination between the multithreaded executor and the mutually exclusive callback group used for all the ROS topics.
When the executor collects all the references to the subscriptions, timers and services, it skips the mutually exclusive callback_groups which are currently locked (ie: being processed by another thread).
This cause the resulting waitset to only contain the guard pointers.
If there is no activity on those guards, the thread will wait for work forever in the get_next_executable and block all other threads.
The resolution is to simply add a timeout for the multithreaded get_next_executable call ensuring the calling thread will eventually return

Testing:
  These changes where implemented and tested using a test application which creates / publish / deletes thousands of topics (up to 100,000) from a separate standalone thread while the ROS2 layer is receiving data on the topics deleted.
The muiltithreaded was setup to contain 10 separate executor threads on a single mutually exclusive callback group containing thousands of topics.

issue: ros2#813
Signed-off-by: Guillaume Autran <gautran@clearpath.ai>
@ivanpauno
Copy link
Member

Closed by #814.
Follow-up issue: #832

nnmm pushed a commit to ApexAI/rclcpp that referenced this issue Jul 9, 2022
* Remove a bunch of redundant nullptr assignments
* Fix some memory leaks
* Ensure source string is deallocated
* Remove redundant array size assignment

Signed-off-by: Scott K Logan <logans@cottsay.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants