Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex protection for db writing and stl collections in writer & storage #603

Merged
merged 7 commits into from
Jan 22, 2021

Conversation

adamdbrw
Copy link
Collaborator

@adamdbrw adamdbrw commented Jan 19, 2021

Adresses the #602 issue.

The issue is with that the current locking mechanism on top level (mutexes in writer.cpp in rosbag2_cpp, added with #491) is valid only with assumption that all writes of data and all creation/removal of topics are synchronous. This is not the case since double buffering has been introduced in the writer. This also would not be the case if another, non-sequential writer implementation was added, and would be unnecessary for another, parallel-enabled storage implementation.

This PR however does not remove mutexing on the rosbag2_cpp level, only adds storage implementation level locks - lets discuss/test whether remaining (intermittent) layers are multithreading-ready.

@adamdbrw adamdbrw requested a review from a team as a code owner January 19, 2021 13:32
@adamdbrw adamdbrw requested review from Karsten1987 and jikawa-az and removed request for a team January 19, 2021 13:32
@adamdbrw adamdbrw changed the title Mutex-protected writes and topic creation/removal in sqlite_storage Mutex-protected db writing in sqlite_storage Jan 19, 2021
@adamdbrw
Copy link
Collaborator Author

The failing test (due to a timeout) for reading looks unrelated and the same test seems to be failing on master as well

@mjeronimo
Copy link
Contributor

My understanding so far:

  • The mutex usage at the rosbag2 writer level are only valid in the case that the back-end implementation is synchronous (it basically assumes a synchronous implementation).
  • Currently, the sqlite3 back-end opens sqlite3 databases using SQLITE_OPEN_NOMUTEX, which means that separate threads are allowed to use SQLite at the same time, as long as each thread is using a different database connection. However, since we are seeing this crash, I'd assume that it is not true that each thread is using a different database connection.
  • The proposed solution keeps the above while introducing another mutex at the Writer level of the sqlite3 back-end, used for any database write operations, effectively compensating for multiple threads using the same database connection.

IMO,

  • The mutex at the rosbag2 writer level should be removed and the burden of thread safety should be on the back-end implementation.
  • Regarding the sqlite3 back-end, if performance does not seem to be an issue, why not push the responsibility further by opening the databases with SQLITE_OPEN_FULLMUTEX and not having a mutex at the Writer level? Then, the responsibility is only in sqlite3 (versus what would be in three places). Vectors of messages seemed to be a potential concern, but it isn't clear if there is a performance issue in practice (because of the I/O dominating).

@adamdbrw
Copy link
Collaborator Author

adamdbrw commented Jan 20, 2021

@pijaro ran a wider array of benchmarks with this fix:
Update:
With current version (FULLMUTEX only) there is another race condition that the v1 of this PR was preventing:

[ros2-1] [INFO] [1611149968.805248872] [rosbag2_transport]: Subscribed to topic '/rosbag2_performance_benchmarking_node/benchmarking_small_178'
[ros2-1] terminate called after throwing an instance of 'rosbag2_storage_plugins::SqliteException'
[ros2-1]   what():  Topic '/rosbag2_performance_benchmarking_node/benchmarking_small_292' has not been created yet! Call 'create_topic' first.

At first glance, this shouldn't happen since topic subscription first calls a synchronous create_topic() function that goes all the way to the storage and there is an execute and reset called in the end, which means it should be commited before the subscribe is on.
However, subscribe_topic can be called from multiple threads.

void Recorder::subscribe_topic(const rosbag2_storage::TopicMetadata & topic)
{
  // Need to create topic in writer before we are trying to create subscription. Since in
  // callback for subscription we are calling writer_->write(bag_message); and it could happened
  // that callback called before we reached out the line: writer_->create_topic(topic)
  writer_->create_topic(topic);
  Rosbag2QoS subscription_qos{subscription_qos_for_topic(topic.name)};
  auto subscription = create_subscription(topic.name, topic.type, subscription_qos);

The race condition can happen with the discovery mechanism (since it is in a separate thread). This means that subscribe_topic can be called from more than one thread. Another component of that race condition is that whether topic is already present is checked not only on the db level, but also in the sequential writer structure. Thus race condition: it requires some imagination to understand but goes as follow:

  1. first thread: calls writer_->create_topic and subsequently SequentialWriter::create_topic, executing code including a call to topics_names_to_info_.insert, but not starting or finishing through the storage_->create_topic call. Now we have topic_names_to_info having the topic inside, but the database has no topic.
  2. second thread: calls writer_->create_topic on the same topic (since it checks subscriptions and the first thread didn't yet subscribe). It does nothing since the topic exists in the topic_names_to_info structure. Then it proceeds to subscribe.

Thus we have a situation where we are subscribed to a topic which isn't in the db. Clearly invalid and causes the exception above.

The fix would be either:

  1. determining whether the topic is already there based on db query each time, with no intermediate structures.
  2. restoring a writer.cpp mutex in create_topic, remove_topic and write (it was there in v1 of this PR). All these calls use intermediate structure in writer (topic_names_to_info), which causes all the trouble.
  3. introducing discovery-level mutex -> which would lock subscribe_topic in Recorder.

A helpful question to ask: do we think it is valid that create_topic can be called twice (or more) with the same topic name. We either defend against that (writer.cpp level locking) or prevent that (recorder.cpp level locking).

I propose to fix it with 2) since it is the conservative approach (less changes, using previous sync mechanism next to our new, neccessary, db-level one). This is now in v3 version of this PR.

Either way, I am for merging a fix that works rather quickly while we can have a discussion on a synchronisation solution that makes the most sense.

@adamdbrw
Copy link
Collaborator Author

@csli8192 this fix does not include discovery mechanism changes which you earlier suggested. If they are still necessary for your use case, I believe is best to merge them in separately.

@adamdbrw adamdbrw marked this pull request as draft January 20, 2021 14:15
@adamdbrw adamdbrw marked this pull request as ready for review January 20, 2021 14:57
@mjeronimo
Copy link
Contributor

@adamdbrw Thanks for the detailed explanation. I now understand that the intermediate data structures also have to be protected and that it's not enough to just use FULLMUTEX.

@emersonknapp
Copy link
Collaborator

emersonknapp commented Jan 21, 2021

The failing test (due to a timeout) for reading looks unrelated and the same test seems to be failing on master as well

Yes - it appears this test started failing on Cyclone (but not FastDDS), which is now the default in rolling - I'm looking into getting this test passing again.

@adamdbrw
Copy link
Collaborator Author

adamdbrw commented Jan 21, 2021

We have given it a deeper thought and ran extensive tests. We found another concurrency problem, and it looks like the version 1 of this changeset was the correct one. Why?

All previous reasoning holds, but there is an additional issue to look at.

SqliteStorage has unordered_map structure holding the topics. It is not a thread safe structure. As it is now, create_topic can be run from one thread and initiate .emplace. Meanwhile, the buffer writer thread will attempt to find a topic for the message it is trying to insert, running .find on the same structure. As you likely know, this threading scenario is not supported in stl containers.

In particular, for unordered_map, a likely case is rehashing. "If rehashing occurs due to the insertion, all iterators are invalidated. Otherwise iterators are not affected".

Thus a rare case which we observed in extensive testing with over a thousand of topics (thanks @pijaro).

Anyway, bottom line is that any stl collections that can be accessed outside of supported model (only 1 thread writes while 0 read, or any number of threads read) should be locked. We don't need to dig deep into details such as rehashing during find - we just need to lock them. And with locking on a member function level, we don't need FULLMUTEX anymore - it is completely redundant.

Our tests are passing now. How to test yourself (simplest version):

  • Update from repo
  • git checkout sqlite_storage_locking
  • git checkout -b local_tests
  • git merge performance_benchmarking_refactor
  • git merge record_with_topic_regex
  • build it, source
  • one terminal, run ros2 bag record -o bag_name -e '/*_benchmarking_node/'
  • second terminal, ros2 run rosbag2_performance_benchmarking benchmark_publishers --ros-args -r __node:=rosbag2_performance_benchmarking_node --params-file `ros2 pkg prefix rosbag2_performance_benchmarking`/share/rosbag2_performance_benchmarking/config/producers/mixed_110Mbs.yaml. Edit number of producers in the yaml as you wish, but use FastDDS if you want more than 119 (we use separate threads for each in benchmark atm and CycloneDDS has a hard limit of 128, of which 10 or so are used internally).
  • you should see a lot of 'Subscribed to topic' and no crash in the rosbag2 window (without this PR you would definitely see one. With a previous version of this PR you would occasionally see one with lots of publishers).

@adamdbrw adamdbrw changed the title Mutex-protected db writing in sqlite_storage Mutex-protection for db writing and stl collections in writer & storage Jan 21, 2021
@adamdbrw adamdbrw changed the title Mutex-protection for db writing and stl collections in writer & storage Mutex protection for db writing and stl collections in writer & storage Jan 21, 2021
Copy link
Collaborator

@emersonknapp emersonknapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - I'm going to (optionally) ask for Clang thread safety annotations, as they will help validate and clarify the logic a little bit.

e.g. https://github.com/ros2/rcpputils/blob/master/include/rcpputils/thread_safety_annotations.hpp#L133

I'm not sure if you need this across the class - but if so, you could guard the two class members that you seem to be synchronizing access to

std::unordered_map<std::string, int> topics_ GUARDED_BY(database_write_mutex_);
std::shared_ptr<SqliteWrapper> database_ GUARDED_BY(database_write_mutex_);

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
…is necessary

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
… why it is necessary"

This reverts commit a022ab7.

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
This reverts commit 9d6e2c2.

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
…ore clear what is protected

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
@emersonknapp
Copy link
Collaborator

Gist: https://gist.githubusercontent.com/emersonknapp/ca5fc9b7640834b2ec160af39e38a1c2/raw/c00c5de4bbf391319d623fde04f97a2a0200aa1b/ros2.repos
BUILD args: --packages-up-to rosbag2_storage_default_plugins rosbag2_tests
TEST args: --packages-select rosbag2_storage_default_plugins rosbag2_tests
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/7502

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@adamdbrw
Copy link
Collaborator Author

All looks green, shall we merge this? :]

@emersonknapp emersonknapp merged commit 277e9cc into master Jan 22, 2021
@delete-merged-branch delete-merged-branch bot deleted the sqlite_storage_locking branch January 22, 2021 22:54
emersonknapp pushed a commit that referenced this pull request Feb 2, 2021
…ge (#603)

* Mutex-protected writes and topic creation/removal in sqlite_storage

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
emersonknapp pushed a commit that referenced this pull request Feb 17, 2021
…ge (#603)

* Mutex-protected writes and topic creation/removal in sqlite_storage

Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants