Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early (incorrect) exit in certain situations with mcap logging #1542

Closed
anrp-tri opened this issue Jan 17, 2024 · 19 comments
Closed

Early (incorrect) exit in certain situations with mcap logging #1542

anrp-tri opened this issue Jan 17, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@anrp-tri
Copy link

anrp-tri commented Jan 17, 2024

Description

We run the ros2 bagger in CI to collect logs of a (CI) simulation run. In this instance, we segregate parallel simulations with different ROS2 domain IDs. Each individual simulation has on the order of ~50 topics. Due to the parallel-ness of this, the machine is significantly loaded i.e. scheduling delays are expected. About 2% of the time, the logger exits during system startup complaining about topic information.

Expected Behavior

The logger does not exit or crash.

Actual Behavior

We get one of three exits, occasionally, during startup while topics are appearing (actual topic in question is not consistent), some logs:

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what():  Channel reference not found for topic: "/ic/speedometer"
2024-01-12 13:29:40,012 ERROR:[INFO] [1705066179.755091747] [rosbag2_recorder]: Subscribed to topic '/ic/warnings/fuel'
2024-01-12 13:29:40,012 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-12 13:29:40,012 ERROR:what():  Unknown message topic "/rosout"

These are both uncaught std::runtime_error exceptions, so the logger exits.

Very occasionally, we get an error (presumably) from STL which smells like memory corruption (combined with the stack traces):

2024-01-11 13:51:23,861 ERROR:[INFO] [1704981083.671525050] [rosbag2_recorder]: Subscribed to topic '/telemetry/roadwheel_angle'
2024-01-11 13:51:23,861 ERROR:terminate called after throwing an instance of 'std::bad_alloc'
2024-01-11 13:51:23,861 ERROR:what():  std::bad_alloc

To Reproduce

I was unable to reproduce this with a test program that spammed new topics. We see this quite reliably in our CI (about 1/50 runs of a simulation). We can instrument the built version of rosbag2_storage_mcap if useful.

System (please complete the following information)

  • OS: Ubuntu Jammy (also focal)
  • ROS 2 Distro: humble (also foxy)
  • Install Method: We build rosbag2_storage_mcap from source
  • Version: 0b4deb3

Additional context

I turned on cores and stack traces but haven't been able to obtain a core file yet from this, however I have gotten some stacks that seem to point at concurrent access to some MCAPStorage instance variables (specifically it looks like a from-Python call is creating a topic at the same time the writer is reading through the topics_ variable). See attached file.
traceback.txt

@anrp-tri anrp-tri added the bug Something isn't working label Jan 17, 2024
@fujitatomoya
Copy link
Contributor

fujitatomoya commented Jan 18, 2024

with quick code scan, it seems to be racy data condition between thread storage writer and discovery to access.

Thread 1 (Thread 0x7f86d34ea640 (LWP 118202)):
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140217047492160, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f86e0736476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f86e071c7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f86df176bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f86df18224c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f86df1822b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f86df182518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f86d3755b23 in rosbag2_storage_plugins::MCAPStorage::write(std::shared_ptr<rosbag2_storage::SerializedBagMessage const>) [clone .cold] () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#10 0x00007f86d3758e6c in rosbag2_storage_plugins::MCAPStorage::write(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#11 0x00007f86deea7002 in rosbag2_cpp::writers::SequentialWriter::write_messages(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /opt/ros/humble/lib/librosbag2_cpp.so
#12 0x00007f86dee7a7f0 in rosbag2_cpp::cache::CacheConsumer::exec_consuming() () from /opt/ros/humble/lib/librosbag2_cpp.so
#13 0x00007f86df1b02b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f86e0788ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007f86e081a850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

i see currently you use ros2 bag record -a -b 500000000 -d 300 -x, can you try adding --no-discovery to see if the same coredump can be generated?

@MichaelOrlov
Copy link
Contributor

@anrp-tri Can you please clarify if you are using ros2 bag record CLI command or if you build your own recorder above rosbag2_storage_mcap
Because from the issue description info:

Install Method: We build rosbag2_storage_mcap from source

This is unclear.

@anrp-tri
Copy link
Author

anrp-tri commented Jan 18, 2024

@anrp-tri Can you please clarify if you are using ros2 bag record CLI command or if you build your own recorder above rosbag2_storage_mcap Because from the issue description info:

We run ros2 bag record ..., not using a different frontend to it.

@anrp-tri
Copy link
Author

i see currently you use ros2 bag record -a -b 500000000 -d 300 -x, can you try adding --no-discovery to see if the same coredump can be generated?

I can attempt this tomorrow - and would expect it to mask any issues - but generally that won't work for our use case as we use -a and a topic blacklist for recording to make sure we get all possible data.

@MichaelOrlov
Copy link
Contributor

@fujitatomoya Your guess about

with quick code scan, it seems to be racy data condition between thread storage writer and discovery to access.

Seems not true. We have made sure that it should not be happen when we are creating subscription after discovering the topic.

void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic)
{
// Need to create topic in writer before we are trying to create subscription. Since in
// callback for subscription we are calling writer_->write(bag_message); and it could happened
// that callback called before we reached out the line: writer_->create_topic(topic)
writer_->create_topic(topic);
Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)};
auto subscription = create_subscription(topic.name, topic.type, subscription_qos);
if (subscription) {
subscriptions_.insert({topic.name, subscription});
RCLCPP_INFO_STREAM(
this->get_logger(),
"Subscribed to topic '" << topic.name << "'");
} else {
writer_->remove_topic(topic);
subscriptions_.erase(topic.name);
}
}

What it could be is that we are discovering the same topic twice and removing topic from writer in the else statement

 } else {
    writer_->remove_topic(topic);
    subscriptions_.erase(topic.name);
  }

@anrp-tri I would recommend to try comment out two of those lines with removing topic from writer and add some logging to see if this is the case and try to reproduce issue.
Also coredump would be very helpfull for further analysis.

@anrp-tri
Copy link
Author

Unrelated to above comments, a nightly run showed a proper segfault in the write function calling in to std::_Hashtable<... which I'm guessing is std::unordered_map<std::string, mcap::ChannelId> channel_ids_
stacks2.txt

@MichaelOrlov
Copy link
Contributor

@anrp-tri

Unrelated to above comments, a nightly run showed a proper segfault in the write function calling in to std::Hashtable<... which I'm guessing is std::unordered_map<std::string, mcap::ChannelId> channel_ids
stacks2.txt

The only possible situation for that crash is if someone would call writer_->remove_topic(topic); in parallel while calling writer_->write(messages); either from the callback or from the cache consumer which is running in the separate thread.
As I've mentioned above try to disable else code with the

    writer_->remove_topic(topic);
    subscriptions_.erase(topic.name);

@MichaelOrlov
Copy link
Contributor

@anrp-tri Or even better way try to also wrap code responsible for creating a new subscription in the check if such subscription already exists. e.g.

void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic)
{
  if (subscriptions_.find(topic.name) == subscriptions_.end()) {
    // Need to create topic in writer before we are trying to create subscription. Since in
    // callback for subscription we are calling writer_->write(bag_message); and it could happened
    // that callback called before we reached out the line: writer_->create_topic(topic)
    writer_->create_topic(topic);

    Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)};
    auto subscription = create_subscription(topic.name, topic.type, subscription_qos);
    if (subscription) {
      subscriptions_.insert({topic.name, subscription});
      RCLCPP_INFO_STREAM(
        this->get_logger(),
        "Subscribed to topic '" << topic.name << "'");
    } else {
      writer_->remove_topic(topic);
      subscriptions_.erase(topic.name);
    }
  }
}

@anrp-tri
Copy link
Author

Sorry to take time to do this - I've enabled cores & system library collection. To be clear, we don't build rosbag2 (for humble) - we use the system provided one - we just build storage_mcap.

@MichaelOrlov
Copy link
Contributor

@anrp-tri

Sorry to take time to do this - I've enabled cores & system library collection. To be clear, we don't build rosbag2 (for humble) - we use the system provided one - we just build storage_mcap.

That might be a problem and cause undefined behavior due to the API/ABI incompatibility between different versions.
We guarantee API/ABI compatibility only in the scope of the one ROS 2 distro release.

@R-Stokke
Copy link

We've experienced the same issue in our CI pipelines, specifically the 'Channel reference not found' error:

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what(): Channel reference not found for topic: "/ic/speedometer"

We also run Humble on Ubuntu 22.04, but we build all of the packages in rosbag2 (commit 21a09c5), and use the standard ros2 bag record.. CLI interface. I've tried modifying our branch with the if-statement suggested by @MichaelOrlov with an added log message to hopefully catch the bug in action. We have a similar reproduction rate of around 1/50 so I'll report back once I know more.

@fujitatomoya
Copy link
Contributor

fujitatomoya commented Feb 1, 2024

sorry for being late to get back to this.

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what(): Channel reference not found for topic: "/ic/speedometer"

this happens at the same time on different topic, i still think this is racy condition problem.

Seems not true. We have made sure that it should not be happen when we are creating subscription after discovering the topic.

i did not mean the racy condition between creating subscriptions and discovery, that seems to be okay.

i meant MCAPStorage::channel_ids_ read and write access with thread from discovery via create_topic, CacheConsumer executor's thread MCAPStorage::write.

according to https://en.cppreference.com/w/cpp/container#Thread_safety, MCAPStorage::channel_ids_ needs to be protected by mutex.

proposal is available here, #1561

@fujitatomoya
Copy link
Contributor

@anrp-tri @R-Stokke if you could try #1561 to see if that solves this problem, that would be really appreciated.

CC: @MichaelOrlov

@R-Stokke
Copy link

R-Stokke commented Feb 1, 2024

@fujitatomoya, when cherrypicking your commit I realized that the humble branch is missing a sanity check that was added to rolling a while back, where the remove_topic function does nothing if a subscription to that topic does not exist- very similar to the proposed fix from @MichaelOrlov. Not sure if this is the culprit, but thought I'd flag it.

Run around 700 tests so far and have not been able to reproduce, will continue with the mutex protections. https://github.com/ros2/rosbag2/blame/b8a9b06c983898da17898bd9248e38bd2cd6a893/rosbag2_storage_mcap/src/mcap_storage.cpp#L843.

@MichaelOrlov
Copy link
Contributor

@fujitatomoya It is unclear for me why we should protect unorderd_map with mutex.
According to the https://en.cppreference.com/w/cpp/container section Iterator invalidation if new elements are inserted to the unorderd_map all iterators are still valid.
Could you please add more details if you are aware of?

@MichaelOrlov
Copy link
Contributor

Ok it seems I found why.
It is all about unordered_map rehashing.
According to the https://en.cppreference.com/w/cpp/container/unordered_map/emplace

If after the operation the new number of elements is greater than old max_load_factor() * bucket_count() a rehashing takes place.
If rehashing occurs (due to the insertion), all iterators are invalidated. Otherwise (no rehashing), iterators are not invalidated.

@fujitatomoya
Copy link
Contributor

@MichaelOrlov thanks for sharing that.

i was originally thinking, reading or modifying the elements should be fine w/o mutex access, but emplace is the concern here. but i would like to have the confirmation to see if mutex can actually fix the problem. @R-Stokke thanks a lot for your effort!

@fujitatomoya
Copy link
Contributor

@R-Stokke just fyi, me and @MichaelOrlov comes to conclusion to change the lock mechanism a bit. if you have not done verification, please try the latest change #1561. thanks in advance.

@MichaelOrlov
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants