Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop custom implementation of boost::condition_variable to fix busy-wait spinning #1878

Merged
merged 13 commits into from Feb 8, 2020

Conversation

meyerj
Copy link
Contributor

@meyerj meyerj commented Feb 5, 2020

An extension of #1651, which addresses #1343 (Regression in 1.12.13: rosout node spins rather than blocking and waiting, using 100% CPU):

Summary:

The patch fixes an ODR violation causing undefined behavior, which - depending on the compiler flags for roscpp and compiler and linker internals - can make nodes either spin at 100% CPU or block indefinitely waiting on a boost::condition_variable. This happens if the symbol of the boost::condition_variable constructor is linked from a compilation unit where BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC has been defined, and symbols of member functions are linked from another compilation unit where BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC has not been defined, or vice-versa.

Actually this macro is meant as a build-time setting in Boost (in boost/thread/detail/config.hpp) and defining it in user packages is always dangerous. It is important to never expose boost::condition_variable instances (or other Boost Thread types influenced by that flag?) in the public API, and to make sure that the constructor and member functions are only called internally by roscpp compilation units, where the BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC is defined. Strictly speaking, it is still an ODR violation because the macro effectively changes the implementation of the same symbols seen by different compilation units, and the patch only works because compilers tend to inline the relevant functions.

This problem will be only solved by Boost 1.67 or above, where BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC (renamed to BOOST_THREAD_INTERNAL_CLOCK_IS_MONO) became the default on platforms that support it (boostorg/thread@1e84b97).

Extension to #1651:

As requested in #1651 (comment) I now completely removed the custom implementation of boost::condition_variable copied from Boost 1.61:

So instead of changing the namespace and selecting which implementation of boost::condition_variable to select depending on the system Boost version it would be straight-forward to drop the alternative implementation as a whole, as long as all target platforms have Boost 1.61 or above.

REP 3 specifies the targeted platforms and for Melodic that is:

  • Ubuntu Artful (which is already EOL and can therefore be ignored),
  • Ubuntu Bionic (coming with Boost 1.65.1),
  • Debian Stretch (coming with Boost 1.62), and
  • Fedora 28 (which we don't build binaries for and can therefore also ignore).

Since that means all relevant platforms have Boost 1.61 or above we should do exactly that for the melodic-devel branch. Can you (or someone) please create a pull request with that proposed change?

meyerj and others added 9 commits February 5, 2020 09:11
…tion_variable (fix ros#1343)

ros#1014 and ros#1250 introduced a backported
version of boost::condition_variable, where support for steady (monotonic) clocks has been added in version 1.61.
But the namespace of the backported version was not changed and the symbol names might clash with the original
version.

Because the underlying clock used for the condition_variable is set in the constructor and must be
consistent with the the expectations within member variables. The compiler might choose to inline one or the
other or both, and is more likely to do so for optimized Release builds. But if it does not, the symbol ends
up in the symbol table of roscpp and depending on which other libraries will be linked into the process it
is unpredictable which of the two versions will be actually called at the end. In case the constructor defined
in `/usr/include/boost/thread/pthread/condition_variable.hpp` was called and did not configure the internal
pthread condition variable for monotonic clock, each call to the backported do_wait_until() method with a
monotonic timestamp will return immediately and hence causes `CallbackQueue::callOne(timeout)` or
`CallbackQueue::callAvailable(timeout)` to return immediately.

This patch changes the namespace of the backported condition_variable implementation to boost_161. This
removes the ambiguity with the original definition if both are used in the same process.
…ed timed_wait()

This fixes ROS timers in combination with 2c18b9f. The timer
callbacks were not called because the TimerManager's thread function blocked indefinitely on
boost::condition_variable::timed_wait().

Relative timed_wait() uses the system clock (boost::get_system_time()) unconditionally to
calculate the absolute timestamp for do_wait_until(). If the condition variable has been
initialized with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC, it compares this timestamp
with the monotonic clock and therefore blocks.

This issue has been reported in https://svn.boost.org/trac10/ticket/12728 and will not be
fixed. The timed_wait interface is apparently deprecated.
…eadFunc() in steady_timer.cpp

The updated generic definition in timer_manager.h should do the same with a minor update.
In all cases we can call boost::condition_variable::wait_until() with an absolute time_point of the respective clock.
The conversion from system_clock to steady_clock for Time and WallTime is done internally in
boost::condition_variable::wait_until(lock_type& lock, const chrono::time_point<Clock, Duration>& t).
find_package(Boost) has to come before checking the Boost version.
Otherwise BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC was not defined which
triggered the assertion in timer_manager.h:240.

Since Boost 1.67 BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC became the default
if the platform supports it and the macro is not defined anymore. Instead, check
for BOOST_THREAD_INTERNAL_CLOCK_IS_MONO.
…ITION_VARIABLE_HEADER macros by a typedef in internal_condition_variable.h
@meyerj
Copy link
Contributor Author

meyerj commented Feb 5, 2020

ABI-compatibility report: compat_report-fix-1343-melodic-devel-2.html

100% compatibility according to abi-compliance-checker (on Ubuntu 18.04 amd64, with gcc 7.4.0). Only symbols exposed in public headers have been considered - as recommended on that page.

@flixr
Copy link
Contributor

flixr commented Feb 6, 2020

👍 for removing the backported boost headers on melodic and up.

@ahoarau
Copy link
Contributor

ahoarau commented Feb 6, 2020

@meyerj Still I have all my CPUs going crazy if I build with clang++-6 and c++14 enabled , on Ubuntu 18.04 with Boost 1.65.1.

@meyerj
Copy link
Contributor Author

meyerj commented Feb 6, 2020

@meyerj Still I have all my CPUs going crazy if I build with clang++-6 and c++14 enabled , on Ubuntu 18.04 with Boost 1.65.1.

It is still possible that the symbols boost::condition_variable() and boost::condition_variable::wait_for() in libroscpp.so, which are compiled with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC being defined, interfere with symbols of the same name generated in other compilation units where the macro has not been defined. So at the end the behavior depends on the linking order and compiler flags for libroscpp.so and other linked libraries (see description of this PR).

I have not tested this idea yet, but one option to get rid of the ODR violation - at least when building shared libraries - is to compile roscpp with hidden symbols (<LANG>_VISIBILITY_PRESET or VISIBILITY_INLINES_HIDDEN cmake properties). Then the Boost symbols and other internally used symbols will not be exposed in the dynamic symbol table and therefore cannot affect other libraries' behavior. This approach requires that all public symbols are exported explicitly using the ROSCPP_DECL macro, but because roscpp is ready for Windows where symbols are hidden by default, it could also work for Linux.

@meyerj
Copy link
Contributor Author

meyerj commented Feb 6, 2020

To be clear, replacing wait_until(sleep_end) with wait_for(sleep_end - current) is not exactly the same.
But I guess for how these Timers are supposed to be used it's ok.

I agree. If you check how boost::condition_variable::wait_for() is implemented for the case BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC is defined, it basically calls steady_clock::now() again and then forwards to wait_until() with the given duration added to the steady clock stamp. For wait_until() and an argument already given in steady-clock time (for the SteadyTime instantiation) it directly calls the pthread implementation without an extra call to now(). For WallTime the two alternatives are mostly equivalent, with a slight advantage for boost::condition_variable::wait_for() because it does not require an extra call to Clock::now().

@meyerj
Copy link
Contributor Author

meyerj commented Feb 6, 2020

I have not tested this idea yet, but one option to get rid of the ODR violation - at least when building shared libraries - is to compile roscpp with hidden symbols (<LANG>_VISIBILITY_PRESET or VISIBILITY_INLINES_HIDDEN cmake properties).

abi-compliance-checker is not happy with a libroscpp.so binary compiled with one of these options enabled. Many symbols are missing compared to the current version: compat_report-fix-1343-melodic-devel-2-2a67cf67-vibility-preset-hidden.html

All those symbols are defined in a roscpp header or implicitly by the compiler, so normally derived packages should not depend on them and compile their own versions instead. This check has been done by comparing debug versions of the libroscpp.so binaries. In release mode the compiler would probably have inlined some of them and they would also not appear in the symbol table either.

The good thing is that also all boost::condition_variable-related symbols disappear from the dynamic symbol table and hence there is no risk that the versions affected by BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC are linked elsewhere.

@dirk-thomas
Copy link
Member

@ahoarau Can you please double check this patch again? Several others have reported in the past that it fixes the problem for them. If it doesn't for you can please share the complete and exact steps you did to try it.

@dirk-thomas
Copy link
Member

I have convinced myself that the current patch fixes the problem for Melodic (and does the right thing by getting rid of the copied Boost files).

  • Using the following Dockerfile I confirmed the CPU usage is high in vanilla Kinetic:
Dockerfile - vanilla Kinetic: FROM ros:kinetic-ros-base-xenial

RUN mkdir -p /catkin_ws/src &&
cd /catkin_ws/src &&
git clone https://github.com/ros/ros_comm.git -b kinetic-devel &&
git clone https://github.com/ros/roscpp_core.git -b kinetic-devel

RUN cd /catkin_ws &&
/ros_entrypoint.sh catkin_make_isolated

RUN echo '#!/bin/bash\n
set -e\n
source "/catkin_ws/devel_isolated/setup.bash"\n
exec "$@"\n' > /devel_entrypoint.sh && chmod a+x /devel_entrypoint.sh

ENTRYPOINT ["/devel_entrypoint.sh"]
CMD ["roscore"]

  • Using the following Dockerfile I confirmed the CPU usage is high in vanilla Melodic:
Dockerfile - vanilla Melodic: FROM ros:melodic-ros-base-bionic

RUN mkdir -p /catkin_ws/src &&
cd /catkin_ws/src &&
git clone https://github.com/ros/ros_comm.git -b melodic-devel &&
git clone https://github.com/ros/roscpp_core.git -b kinetic-devel

RUN cd /catkin_ws &&
/ros_entrypoint.sh catkin_make_isolated

RUN echo '#!/bin/bash\n
set -e\n
source "/catkin_ws/devel_isolated/setup.bash"\n
exec "$@"\n' > /devel_entrypoint.sh && chmod a+x /devel_entrypoint.sh

ENTRYPOINT ["/devel_entrypoint.sh"]
CMD ["roscore"]

  • Using the following Dockerfile I confirmed the CPU usage is not high anymore using this patch in Melodic:
Dockerfile - patched Melodic: FROM ros:melodic-ros-base-bionic

RUN mkdir -p /catkin_ws/src &&
cd /catkin_ws/src &&
git clone https://github.com/meyerj/ros_comm.git -b fix-1343-melodic-devel-2 &&
git clone https://github.com/ros/roscpp_core.git -b kinetic-devel

RUN cd /catkin_ws &&
/ros_entrypoint.sh catkin_make_isolated

RUN echo '#!/bin/bash\n
set -e\n
source "/catkin_ws/devel_isolated/setup.bash"\n
exec "$@"\n' > /devel_entrypoint.sh && chmod a+x /devel_entrypoint.sh

ENTRYPOINT ["/devel_entrypoint.sh"]
CMD ["roscore"]

I will go ahead and merge this patch by the end of the day (if there are no objections). Thank you to everyone who has contributed to this - especially @meyerj for creating and iterating on the patch.

@matthew-reynolds
Copy link

Several others have reported in the past that it fixes the problem for them

I was one of those people, but I only compiled using GCC, so I just re-ran your two Melodic Dockerfiles (Didn't bother w/ Kinetic) using clang-6.0/clang++-6.0.

I got the same behaviour with clang as you did with gcc:

  • Without the patch, very high CPU usage.
  • With the patch, near 0% CPU usage.

The Dockerfile is using the Ubuntu Bionic defaults: clang-6.0, boost 1.65.1. Perhaps if @ahoarau is still seeing poor behaviour there's something else at play than just a different compiler.

@dirk-thomas
Copy link
Member

Moving forward and merging this. Please feel free to continue commenting on the closed ticket.

@dirk-thomas dirk-thomas merged commit d80c914 into ros:melodic-devel Feb 8, 2020
@dirk-thomas
Copy link
Member

Special thanks to @meyerj for working on multiple iterations of this important fix as well as everyone else who helped with this.

@meyerj meyerj deleted the fix-1343-melodic-devel-2 branch February 20, 2020 22:50
meyerj added a commit to meyerj/ros_comm that referenced this pull request Jul 31, 2020
…ait spinning (ros#1878)

* roscpp: fix potential busy-wait loop caused by backported Boost condition_variable (fix ros#1343)

ros#1014 and ros#1250 introduced a backported
version of boost::condition_variable, where support for steady (monotonic) clocks has been added in version 1.61.
But the namespace of the backported version was not changed and the symbol names might clash with the original
version.

Because the underlying clock used for the condition_variable is set in the constructor and must be
consistent with the the expectations within member variables. The compiler might choose to inline one or the
other or both, and is more likely to do so for optimized Release builds. But if it does not, the symbol ends
up in the symbol table of roscpp and depending on which other libraries will be linked into the process it
is unpredictable which of the two versions will be actually called at the end. In case the constructor defined
in `/usr/include/boost/thread/pthread/condition_variable.hpp` was called and did not configure the internal
pthread condition variable for monotonic clock, each call to the backported do_wait_until() method with a
monotonic timestamp will return immediately and hence causes `CallbackQueue::callOne(timeout)` or
`CallbackQueue::callAvailable(timeout)` to return immediately.

This patch changes the namespace of the backported condition_variable implementation to boost_161. This
removes the ambiguity with the original definition if both are used in the same process.

* roscpp: use boost::condition_variable::wait_for() instead of deprecated timed_wait()

This fixes ROS timers in combination with 2c18b9f. The timer
callbacks were not called because the TimerManager's thread function blocked indefinitely on
boost::condition_variable::timed_wait().

Relative timed_wait() uses the system clock (boost::get_system_time()) unconditionally to
calculate the absolute timestamp for do_wait_until(). If the condition variable has been
initialized with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC, it compares this timestamp
with the monotonic clock and therefore blocks.

This issue has been reported in https://svn.boost.org/trac10/ticket/12728 and will not be
fixed. The timed_wait interface is apparently deprecated.

* roscpp: do not use boost_161_condition_variable.h on Windows (untested)

* roscpp: remove specialized implementation of TimerManager<T,D,E>::threadFunc() in steady_timer.cpp

The updated generic definition in timer_manager.h should do the same with a minor update.
In all cases we can call boost::condition_variable::wait_until() with an absolute time_point of the respective clock.
The conversion from system_clock to steady_clock for Time and WallTime is done internally in
boost::condition_variable::wait_until(lock_type& lock, const chrono::time_point<Clock, Duration>& t).

* fix namespaces

* add more explicit namespaces

* add missing ns

* roscpp: fixed Boost version check in CMakeLists.txt

find_package(Boost) has to come before checking the Boost version.
Otherwise BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC was not defined which
triggered the assertion in timer_manager.h:240.

Since Boost 1.67 BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC became the default
if the platform supports it and the macro is not defined anymore. Instead, check
for BOOST_THREAD_INTERNAL_CLOCK_IS_MONO.

* roscpp: replace ROSCPP_BOOST_CONDITION_VARIABLE and ROSCPP_BOOST_CONDITION_VARIABLE_HEADER macros by a typedef in internal_condition_variable.h

* Remove copy of boost::condition_variable implementation from Boost 1.61 in namespace boost_161

* Revert some changes in include directives and in CMakeLists.txt to minimize the diff to melodic-devel

Addresses ros#1878 (review).

* use wait_for(), remove TimerManagerTraits

* Revert "use wait_for(), remove TimerManagerTraits"

This reverts commit 2a67cf6.

Co-authored-by: Antoine Hoarau <703240+ahoarau@users.noreply.github.com>
Co-authored-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
dirk-thomas added a commit that referenced this pull request Aug 3, 2020
…tonic clock [kinetic-devel] (#2011)

* Drop custom implementation of boost::condition_variable to fix busy-wait spinning (#1878)

* roscpp: fix potential busy-wait loop caused by backported Boost condition_variable (fix #1343)

#1014 and #1250 introduced a backported
version of boost::condition_variable, where support for steady (monotonic) clocks has been added in version 1.61.
But the namespace of the backported version was not changed and the symbol names might clash with the original
version.

Because the underlying clock used for the condition_variable is set in the constructor and must be
consistent with the the expectations within member variables. The compiler might choose to inline one or the
other or both, and is more likely to do so for optimized Release builds. But if it does not, the symbol ends
up in the symbol table of roscpp and depending on which other libraries will be linked into the process it
is unpredictable which of the two versions will be actually called at the end. In case the constructor defined
in `/usr/include/boost/thread/pthread/condition_variable.hpp` was called and did not configure the internal
pthread condition variable for monotonic clock, each call to the backported do_wait_until() method with a
monotonic timestamp will return immediately and hence causes `CallbackQueue::callOne(timeout)` or
`CallbackQueue::callAvailable(timeout)` to return immediately.

This patch changes the namespace of the backported condition_variable implementation to boost_161. This
removes the ambiguity with the original definition if both are used in the same process.

* roscpp: use boost::condition_variable::wait_for() instead of deprecated timed_wait()

This fixes ROS timers in combination with 2c18b9f. The timer
callbacks were not called because the TimerManager's thread function blocked indefinitely on
boost::condition_variable::timed_wait().

Relative timed_wait() uses the system clock (boost::get_system_time()) unconditionally to
calculate the absolute timestamp for do_wait_until(). If the condition variable has been
initialized with BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC, it compares this timestamp
with the monotonic clock and therefore blocks.

This issue has been reported in https://svn.boost.org/trac10/ticket/12728 and will not be
fixed. The timed_wait interface is apparently deprecated.

* roscpp: do not use boost_161_condition_variable.h on Windows (untested)

* roscpp: remove specialized implementation of TimerManager<T,D,E>::threadFunc() in steady_timer.cpp

The updated generic definition in timer_manager.h should do the same with a minor update.
In all cases we can call boost::condition_variable::wait_until() with an absolute time_point of the respective clock.
The conversion from system_clock to steady_clock for Time and WallTime is done internally in
boost::condition_variable::wait_until(lock_type& lock, const chrono::time_point<Clock, Duration>& t).

* fix namespaces

* add more explicit namespaces

* add missing ns

* roscpp: fixed Boost version check in CMakeLists.txt

find_package(Boost) has to come before checking the Boost version.
Otherwise BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC was not defined which
triggered the assertion in timer_manager.h:240.

Since Boost 1.67 BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC became the default
if the platform supports it and the macro is not defined anymore. Instead, check
for BOOST_THREAD_INTERNAL_CLOCK_IS_MONO.

* roscpp: replace ROSCPP_BOOST_CONDITION_VARIABLE and ROSCPP_BOOST_CONDITION_VARIABLE_HEADER macros by a typedef in internal_condition_variable.h

* Remove copy of boost::condition_variable implementation from Boost 1.61 in namespace boost_161

* Revert some changes in include directives and in CMakeLists.txt to minimize the diff to melodic-devel

Addresses #1878 (review).

* use wait_for(), remove TimerManagerTraits

* Revert "use wait_for(), remove TimerManagerTraits"

This reverts commit 2a67cf6.

Co-authored-by: Antoine Hoarau <703240+ahoarau@users.noreply.github.com>
Co-authored-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

* Use an internal implementation of boost::condition_variable with monotonic clock (#1932)

* Use an internal implementation of condition_variable with monotonic clock to avoid ODR violation

* Fix build of ros/internal/condition_variable.h for Boost <1.65

* Fix "static_assert with no message is a C++17 extension" warning in ros/internal/condition_variable.h

* roscpp: fix ros/internal/condition_variable.h for pre-C++11 compilers

Co-authored-by: Antoine Hoarau <703240+ahoarau@users.noreply.github.com>
Co-authored-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants