Early (incorrect) exit in certain situations with mcap logging #1542

anrp-tri · 2024-01-17T19:53:43Z

Description

We run the ros2 bagger in CI to collect logs of a (CI) simulation run. In this instance, we segregate parallel simulations with different ROS2 domain IDs. Each individual simulation has on the order of ~50 topics. Due to the parallel-ness of this, the machine is significantly loaded i.e. scheduling delays are expected. About 2% of the time, the logger exits during system startup complaining about topic information.

Expected Behavior

The logger does not exit or crash.

Actual Behavior

We get one of three exits, occasionally, during startup while topics are appearing (actual topic in question is not consistent), some logs:

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what():  Channel reference not found for topic: "/ic/speedometer"

2024-01-12 13:29:40,012 ERROR:[INFO] [1705066179.755091747] [rosbag2_recorder]: Subscribed to topic '/ic/warnings/fuel'
2024-01-12 13:29:40,012 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-12 13:29:40,012 ERROR:what():  Unknown message topic "/rosout"

These are both uncaught std::runtime_error exceptions, so the logger exits.

Very occasionally, we get an error (presumably) from STL which smells like memory corruption (combined with the stack traces):

2024-01-11 13:51:23,861 ERROR:[INFO] [1704981083.671525050] [rosbag2_recorder]: Subscribed to topic '/telemetry/roadwheel_angle'
2024-01-11 13:51:23,861 ERROR:terminate called after throwing an instance of 'std::bad_alloc'
2024-01-11 13:51:23,861 ERROR:what():  std::bad_alloc

To Reproduce

I was unable to reproduce this with a test program that spammed new topics. We see this quite reliably in our CI (about 1/50 runs of a simulation). We can instrument the built version of rosbag2_storage_mcap if useful.

System (please complete the following information)

OS: Ubuntu Jammy (also focal)
ROS 2 Distro: humble (also foxy)
Install Method: We build rosbag2_storage_mcap from source
Version: 0b4deb3

Additional context

I turned on cores and stack traces but haven't been able to obtain a core file yet from this, however I have gotten some stacks that seem to point at concurrent access to some MCAPStorage instance variables (specifically it looks like a from-Python call is creating a topic at the same time the writer is reading through the topics_ variable). See attached file.
traceback.txt

The text was updated successfully, but these errors were encountered:

fujitatomoya · 2024-01-18T00:31:44Z

with quick code scan, it seems to be racy data condition between thread storage writer and discovery to access.

Thread 1 (Thread 0x7f86d34ea640 (LWP 118202)):
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140217047492160) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140217047492160, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f86e0736476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f86e071c7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f86df176bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f86df18224c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f86df1822b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f86df182518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f86d3755b23 in rosbag2_storage_plugins::MCAPStorage::write(std::shared_ptr<rosbag2_storage::SerializedBagMessage const>) [clone .cold] () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#10 0x00007f86d3758e6c in rosbag2_storage_plugins::MCAPStorage::write(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /home/ubuntu/jenkins-root/workspace/motion-sim-dev/install/lib/librosbag2_storage_mcap.so
#11 0x00007f86deea7002 in rosbag2_cpp::writers::SequentialWriter::write_messages(std::vector<std::shared_ptr<rosbag2_storage::SerializedBagMessage const>, std::allocator<std::shared_ptr<rosbag2_storage::SerializedBagMessage const> > > const&) () from /opt/ros/humble/lib/librosbag2_cpp.so
#12 0x00007f86dee7a7f0 in rosbag2_cpp::cache::CacheConsumer::exec_consuming() () from /opt/ros/humble/lib/librosbag2_cpp.so
#13 0x00007f86df1b02b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f86e0788ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#15 0x00007f86e081a850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

i see currently you use ros2 bag record -a -b 500000000 -d 300 -x, can you try adding --no-discovery to see if the same coredump can be generated?

MichaelOrlov · 2024-01-18T02:11:57Z

@anrp-tri Can you please clarify if you are using ros2 bag record CLI command or if you build your own recorder above rosbag2_storage_mcap
Because from the issue description info:

Install Method: We build rosbag2_storage_mcap from source

This is unclear.

anrp-tri · 2024-01-18T02:17:23Z

@anrp-tri Can you please clarify if you are using ros2 bag record CLI command or if you build your own recorder above rosbag2_storage_mcap Because from the issue description info:

We run ros2 bag record ..., not using a different frontend to it.

anrp-tri · 2024-01-18T02:19:54Z

i see currently you use ros2 bag record -a -b 500000000 -d 300 -x, can you try adding --no-discovery to see if the same coredump can be generated?

I can attempt this tomorrow - and would expect it to mask any issues - but generally that won't work for our use case as we use -a and a topic blacklist for recording to make sure we get all possible data.

MichaelOrlov · 2024-01-18T02:32:36Z

@fujitatomoya Your guess about

with quick code scan, it seems to be racy data condition between thread storage writer and discovery to access.

Seems not true. We have made sure that it should not be happen when we are creating subscription after discovering the topic.

rosbag2/rosbag2_transport/src/rosbag2_transport/recorder.cpp

Lines 347 to 365 in 0b4deb3

    
           void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic) 
        
           { 
        
             // Need to create topic in writer before we are trying to create subscription. Since in 
        
             // callback for subscription we are calling writer_->write(bag_message); and it could happened 
        
             // that callback called before we reached out the line: writer_->create_topic(topic) 
        
             writer_->create_topic(topic); 
        
             Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)}; 
        
             auto subscription = create_subscription(topic.name, topic.type, subscription_qos); 
        
             if (subscription) { 
        
               subscriptions_.insert({topic.name, subscription}); 
        
               RCLCPP_INFO_STREAM( 
        
                 this->get_logger(), 
        
                 "Subscribed to topic '" << topic.name << "'"); 
        
             } else { 
        
               writer_->remove_topic(topic); 
        
               subscriptions_.erase(topic.name); 
        
             } 
        
           }

What it could be is that we are discovering the same topic twice and removing topic from writer in the else statement

 } else {
    writer_->remove_topic(topic);
    subscriptions_.erase(topic.name);
  }

@anrp-tri I would recommend to try comment out two of those lines with removing topic from writer and add some logging to see if this is the case and try to reproduce issue.
Also coredump would be very helpfull for further analysis.

anrp-tri · 2024-01-18T16:24:15Z

Unrelated to above comments, a nightly run showed a proper segfault in the write function calling in to std::_Hashtable<... which I'm guessing is std::unordered_map<std::string, mcap::ChannelId> channel_ids_
stacks2.txt

MichaelOrlov · 2024-01-18T18:32:36Z

@anrp-tri

Unrelated to above comments, a nightly run showed a proper segfault in the write function calling in to std::Hashtable<... which I'm guessing is std::unordered_map<std::string, mcap::ChannelId> channel_ids
stacks2.txt

The only possible situation for that crash is if someone would call writer_->remove_topic(topic); in parallel while calling writer_->write(messages); either from the callback or from the cache consumer which is running in the separate thread.
As I've mentioned above try to disable else code with the

    writer_->remove_topic(topic);
    subscriptions_.erase(topic.name);

MichaelOrlov · 2024-01-18T18:43:05Z

@anrp-tri Or even better way try to also wrap code responsible for creating a new subscription in the check if such subscription already exists. e.g.

void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic)
{
  if (subscriptions_.find(topic.name) == subscriptions_.end()) {
    // Need to create topic in writer before we are trying to create subscription. Since in
    // callback for subscription we are calling writer_->write(bag_message); and it could happened
    // that callback called before we reached out the line: writer_->create_topic(topic)
    writer_->create_topic(topic);

    Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)};
    auto subscription = create_subscription(topic.name, topic.type, subscription_qos);
    if (subscription) {
      subscriptions_.insert({topic.name, subscription});
      RCLCPP_INFO_STREAM(
        this->get_logger(),
        "Subscribed to topic '" << topic.name << "'");
    } else {
      writer_->remove_topic(topic);
      subscriptions_.erase(topic.name);
    }
  }
}

anrp-tri · 2024-01-24T22:01:44Z

Sorry to take time to do this - I've enabled cores & system library collection. To be clear, we don't build rosbag2 (for humble) - we use the system provided one - we just build storage_mcap.

MichaelOrlov · 2024-01-25T19:03:03Z

@anrp-tri

Sorry to take time to do this - I've enabled cores & system library collection. To be clear, we don't build rosbag2 (for humble) - we use the system provided one - we just build storage_mcap.

That might be a problem and cause undefined behavior due to the API/ABI incompatibility between different versions.
We guarantee API/ABI compatibility only in the scope of the one ROS 2 distro release.

R-Stokke · 2024-01-31T14:58:47Z

We've experienced the same issue in our CI pipelines, specifically the 'Channel reference not found' error:

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what(): Channel reference not found for topic: "/ic/speedometer"

We also run Humble on Ubuntu 22.04, but we build all of the packages in rosbag2 (commit 21a09c5), and use the standard ros2 bag record.. CLI interface. I've tried modifying our branch with the if-statement suggested by @MichaelOrlov with an added log message to hopefully catch the bug in action. We have a similar reproduction rate of around 1/50 so I'll report back once I know more.

fujitatomoya · 2024-02-01T17:51:22Z

sorry for being late to get back to this.

2024-01-15 13:49:52,859 ERROR:[INFO] [1705326592.534510178] [rosbag2_recorder]: Subscribed to topic '/carla/objects'
2024-01-15 13:49:52,859 ERROR:terminate called after throwing an instance of 'std::runtime_error'
2024-01-15 13:49:52,859 ERROR:what(): Channel reference not found for topic: "/ic/speedometer"

this happens at the same time on different topic, i still think this is racy condition problem.

Seems not true. We have made sure that it should not be happen when we are creating subscription after discovering the topic.

i did not mean the racy condition between creating subscriptions and discovery, that seems to be okay.

i meant MCAPStorage::channel_ids_ read and write access with thread from discovery via create_topic, CacheConsumer ~~executor~~'s thread MCAPStorage::write.

according to https://en.cppreference.com/w/cpp/container#Thread_safety, MCAPStorage::channel_ids_ needs to be protected by mutex.

proposal is available here, #1561

fujitatomoya · 2024-02-01T17:52:08Z

@anrp-tri @R-Stokke if you could try #1561 to see if that solves this problem, that would be really appreciated.

CC: @MichaelOrlov

R-Stokke · 2024-02-01T18:45:05Z

@fujitatomoya, when cherrypicking your commit I realized that the humble branch is missing a sanity check that was added to rolling a while back, where the remove_topic function does nothing if a subscription to that topic does not exist- very similar to the proposed fix from @MichaelOrlov. Not sure if this is the culprit, but thought I'd flag it.

Run around 700 tests so far and have not been able to reproduce, will continue with the mutex protections. https://github.com/ros2/rosbag2/blame/b8a9b06c983898da17898bd9248e38bd2cd6a893/rosbag2_storage_mcap/src/mcap_storage.cpp#L843.

MichaelOrlov · 2024-02-01T19:03:34Z

@fujitatomoya It is unclear for me why we should protect unorderd_map with mutex.
According to the https://en.cppreference.com/w/cpp/container section Iterator invalidation if new elements are inserted to the unorderd_map all iterators are still valid.
Could you please add more details if you are aware of?

MichaelOrlov · 2024-02-01T19:17:06Z

Ok it seems I found why.
It is all about unordered_map rehashing.
According to the https://en.cppreference.com/w/cpp/container/unordered_map/emplace

If after the operation the new number of elements is greater than old max_load_factor() * bucket_count() a rehashing takes place.
If rehashing occurs (due to the insertion), all iterators are invalidated. Otherwise (no rehashing), iterators are not invalidated.

fujitatomoya · 2024-02-01T20:40:08Z

@MichaelOrlov thanks for sharing that.

i was originally thinking, reading or modifying the elements should be fine w/o mutex access, but emplace is the concern here. but i would like to have the confirmation to see if mutex can actually fix the problem. @R-Stokke thanks a lot for your effort!

fujitatomoya · 2024-02-02T20:06:13Z

@R-Stokke just fyi, me and @MichaelOrlov comes to conclusion to change the lock mechanism a bit. if you have not done verification, please try the latest change #1561. thanks in advance.

MichaelOrlov · 2024-02-05T02:39:45Z

Closing as fixed in the Use rw_lock to protect mcap metadata lists. #1561 and [humble] Use rw_lock to protect mcap metadata lists. (backport #1561) #1567

anrp-tri added the bug Something isn't working label Jan 17, 2024

fujitatomoya mentioned this issue Feb 1, 2024

Use rw_lock to protect mcap metadata lists. #1561

Merged

MichaelOrlov closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early (incorrect) exit in certain situations with mcap logging #1542

Early (incorrect) exit in certain situations with mcap logging #1542

anrp-tri commented Jan 17, 2024 •

edited

Loading

fujitatomoya commented Jan 18, 2024 •

edited

Loading

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 18, 2024 •

edited

Loading

anrp-tri commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 24, 2024

MichaelOrlov commented Jan 25, 2024

R-Stokke commented Jan 31, 2024

fujitatomoya commented Feb 1, 2024 •

edited

Loading

fujitatomoya commented Feb 1, 2024

R-Stokke commented Feb 1, 2024

MichaelOrlov commented Feb 1, 2024

MichaelOrlov commented Feb 1, 2024

fujitatomoya commented Feb 1, 2024

fujitatomoya commented Feb 2, 2024

MichaelOrlov commented Feb 5, 2024

Early (incorrect) exit in certain situations with mcap logging #1542

Early (incorrect) exit in certain situations with mcap logging #1542

Comments

anrp-tri commented Jan 17, 2024 • edited Loading

Description

Expected Behavior

Actual Behavior

To Reproduce

System (please complete the following information)

Additional context

fujitatomoya commented Jan 18, 2024 • edited Loading

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 18, 2024 • edited Loading

anrp-tri commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

MichaelOrlov commented Jan 18, 2024

anrp-tri commented Jan 24, 2024

MichaelOrlov commented Jan 25, 2024

R-Stokke commented Jan 31, 2024

fujitatomoya commented Feb 1, 2024 • edited Loading

fujitatomoya commented Feb 1, 2024

R-Stokke commented Feb 1, 2024

MichaelOrlov commented Feb 1, 2024

MichaelOrlov commented Feb 1, 2024

fujitatomoya commented Feb 1, 2024

fujitatomoya commented Feb 2, 2024

MichaelOrlov commented Feb 5, 2024

anrp-tri commented Jan 17, 2024 •

edited

Loading

fujitatomoya commented Jan 18, 2024 •

edited

Loading

anrp-tri commented Jan 18, 2024 •

edited

Loading

fujitatomoya commented Feb 1, 2024 •

edited

Loading