-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early (incorrect) exit in certain situations with mcap logging #1542
Comments
with quick code scan, it seems to be racy data condition between thread storage Thread 1 (Thread 0x7f86d34ea640 (LWP 118202)):
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140217047492160, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f86e0736476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f86e071c7f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007f86df176bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f86df18224c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007f86df1822b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007f86df182518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007f86d3755b23 in rosbag2_storage_plugins::MCAPStorage::write(std::shared_ptr<rosbag2_storage::SerializedBagMessage const>) [clone .cold] () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#10 0x00007f86d3758e6c in rosbag2_storage_plugins::MCAPStorage::write(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#11 0x00007f86deea7002 in rosbag2_cpp::writers::SequentialWriter::write_messages(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /opt/ros/humble/lib/librosbag2_cpp.so
#12 0x00007f86dee7a7f0 in rosbag2_cpp::cache::CacheConsumer::exec_consuming() () from /opt/ros/humble/lib/librosbag2_cpp.so
#13 0x00007f86df1b02b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f86e0788ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007f86e081a850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 i see currently you use |
@anrp-tri Can you please clarify if you are using
This is unclear. |
We run |
I can attempt this tomorrow - and would expect it to mask any issues - but generally that won't work for our use case as we use -a and a topic blacklist for recording to make sure we get all possible data. |
@fujitatomoya Your guess about
Seems not true. We have made sure that it should not be happen when we are creating subscription after discovering the topic. rosbag2/rosbag2_transport/src/rosbag2_transport/recorder.cpp Lines 347 to 365 in 0b4deb3
What it could be is that we are discovering the same topic twice and removing topic from writer in the else statement
@anrp-tri I would recommend to try comment out two of those lines with removing topic from writer and add some logging to see if this is the case and try to reproduce issue. |
Unrelated to above comments, a nightly run showed a proper segfault in the write function calling in to |
The only possible situation for that crash is if someone would call
|
@anrp-tri Or even better way try to also wrap code responsible for creating a new subscription in the check if such subscription already exists. e.g. void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic)
{
if (subscriptions_.find(topic.name) == subscriptions_.end()) {
// Need to create topic in writer before we are trying to create subscription. Since in
// callback for subscription we are calling writer_->write(bag_message); and it could happened
// that callback called before we reached out the line: writer_->create_topic(topic)
writer_->create_topic(topic);
Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)};
auto subscription = create_subscription(topic.name, topic.type, subscription_qos);
if (subscription) {
subscriptions_.insert({topic.name, subscription});
RCLCPP_INFO_STREAM(
this->get_logger(),
"Subscribed to topic '" << topic.name << "'");
} else {
writer_->remove_topic(topic);
subscriptions_.erase(topic.name);
}
}
}
|
Sorry to take time to do this - I've enabled cores & system library collection. To be clear, we don't build rosbag2 (for humble) - we use the system provided one - we just build storage_mcap. |
That might be a problem and cause undefined behavior due to the API/ABI incompatibility between different versions. |
We've experienced the same issue in our CI pipelines, specifically the 'Channel reference not found' error:
We also run Humble on Ubuntu 22.04, but we build all of the packages in rosbag2 (commit 21a09c5), and use the standard ros2 bag record.. CLI interface. I've tried modifying our branch with the if-statement suggested by @MichaelOrlov with an added log message to hopefully catch the bug in action. We have a similar reproduction rate of around 1/50 so I'll report back once I know more. |
sorry for being late to get back to this. 2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what(): Channel reference not found for topic: "/ic/speedometer" this happens at the same time on different topic, i still think this is racy condition problem.
i did not mean the racy condition between i meant according to https://en.cppreference.com/w/cpp/container#Thread_safety, proposal is available here, #1561 |
@anrp-tri @R-Stokke if you could try #1561 to see if that solves this problem, that would be really appreciated. CC: @MichaelOrlov |
@fujitatomoya, when cherrypicking your commit I realized that the humble branch is missing a sanity check that was added to rolling a while back, where the remove_topic function does nothing if a subscription to that topic does not exist- very similar to the proposed fix from @MichaelOrlov. Not sure if this is the culprit, but thought I'd flag it. Run around 700 tests so far and have not been able to reproduce, will continue with the mutex protections. https://github.com/ros2/rosbag2/blame/b8a9b06c983898da17898bd9248e38bd2cd6a893/rosbag2_storage_mcap/src/mcap_storage.cpp#L843. |
@fujitatomoya It is unclear for me why we should protect unorderd_map with mutex. |
Ok it seems I found why.
|
@MichaelOrlov thanks for sharing that. i was originally thinking, reading or modifying the elements should be fine w/o mutex access, but |
@R-Stokke just fyi, me and @MichaelOrlov comes to conclusion to change the lock mechanism a bit. if you have not done verification, please try the latest change #1561. thanks in advance. |
Description
We run the ros2 bagger in CI to collect logs of a (CI) simulation run. In this instance, we segregate parallel simulations with different ROS2 domain IDs. Each individual simulation has on the order of ~50 topics. Due to the parallel-ness of this, the machine is significantly loaded i.e. scheduling delays are expected. About 2% of the time, the logger exits during system startup complaining about topic information.
Expected Behavior
The logger does not exit or crash.
Actual Behavior
We get one of three exits, occasionally, during startup while topics are appearing (actual topic in question is not consistent), some logs:
These are both uncaught
std::runtime_error
exceptions, so the logger exits.Very occasionally, we get an error (presumably) from STL which smells like memory corruption (combined with the stack traces):
To Reproduce
I was unable to reproduce this with a test program that spammed new topics. We see this quite reliably in our CI (about 1/50 runs of a simulation). We can instrument the built version of rosbag2_storage_mcap if useful.
System (please complete the following information)
Additional context
I turned on cores and stack traces but haven't been able to obtain a core file yet from this, however I have gotten some stacks that seem to point at concurrent access to some MCAPStorage instance variables (specifically it looks like a from-Python call is creating a topic at the same time the writer is reading through the topics_ variable). See attached file.
traceback.txt
The text was updated successfully, but these errors were encountered: