-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutex protection for db writing and stl collections in writer & storage #603
Conversation
The failing test (due to a timeout) for reading looks unrelated and the same test seems to be failing on master as well |
My understanding so far:
IMO,
|
@pijaro ran a wider array of benchmarks with this fix:
At first glance, this shouldn't happen since topic subscription first calls a synchronous create_topic() function that goes all the way to the storage and there is an execute and reset called in the end, which means it should be commited before the subscribe is on.
The race condition can happen with the discovery mechanism (since it is in a separate thread). This means that subscribe_topic can be called from more than one thread. Another component of that race condition is that whether topic is already present is checked not only on the db level, but also in the sequential writer structure. Thus race condition: it requires some imagination to understand but goes as follow:
Thus we have a situation where we are subscribed to a topic which isn't in the db. Clearly invalid and causes the exception above. The fix would be either:
A helpful question to ask: do we think it is valid that create_topic can be called twice (or more) with the same topic name. We either defend against that (writer.cpp level locking) or prevent that (recorder.cpp level locking). I propose to fix it with 2) since it is the conservative approach (less changes, using previous sync mechanism next to our new, neccessary, db-level one). This is now in v3 version of this PR. Either way, I am for merging a fix that works rather quickly while we can have a discussion on a synchronisation solution that makes the most sense. |
@csli8192 this fix does not include discovery mechanism changes which you earlier suggested. If they are still necessary for your use case, I believe is best to merge them in separately. |
@adamdbrw Thanks for the detailed explanation. I now understand that the intermediate data structures also have to be protected and that it's not enough to just use FULLMUTEX. |
...g2_storage_default_plugins/include/rosbag2_storage_default_plugins/sqlite/sqlite_storage.hpp
Outdated
Show resolved
Hide resolved
rosbag2_storage_default_plugins/src/rosbag2_storage_default_plugins/sqlite/sqlite_wrapper.cpp
Outdated
Show resolved
Hide resolved
Yes - it appears this test started failing on Cyclone (but not FastDDS), which is now the default in rolling - I'm looking into getting this test passing again. |
We have given it a deeper thought and ran extensive tests. We found another concurrency problem, and it looks like the version 1 of this changeset was the correct one. Why? All previous reasoning holds, but there is an additional issue to look at. SqliteStorage has unordered_map structure holding the topics. It is not a thread safe structure. As it is now, create_topic can be run from one thread and initiate In particular, for unordered_map, a likely case is rehashing. "If rehashing occurs due to the insertion, all iterators are invalidated. Otherwise iterators are not affected". Thus a rare case which we observed in extensive testing with over a thousand of topics (thanks @pijaro). Anyway, bottom line is that any stl collections that can be accessed outside of supported model (only 1 thread writes while 0 read, or any number of threads read) should be locked. We don't need to dig deep into details such as rehashing during find - we just need to lock them. And with locking on a member function level, we don't need FULLMUTEX anymore - it is completely redundant. Our tests are passing now. How to test yourself (simplest version):
|
75fba53
to
6ff6371
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good - I'm going to (optionally) ask for Clang thread safety annotations, as they will help validate and clarify the logic a little bit.
e.g. https://github.com/ros2/rcpputils/blob/master/include/rcpputils/thread_safety_annotations.hpp#L133
I'm not sure if you need this across the class - but if so, you could guard the two class members that you seem to be synchronizing access to
std::unordered_map<std::string, int> topics_ GUARDED_BY(database_write_mutex_);
std::shared_ptr<SqliteWrapper> database_ GUARDED_BY(database_write_mutex_);
...g2_storage_default_plugins/include/rosbag2_storage_default_plugins/sqlite/sqlite_storage.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
…is necessary Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
… why it is necessary" This reverts commit a022ab7. Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
This reverts commit 9d6e2c2. Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
…ore clear what is protected Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
6ff6371
to
e7d7c74
Compare
Gist: https://gist.githubusercontent.com/emersonknapp/ca5fc9b7640834b2ec160af39e38a1c2/raw/c00c5de4bbf391319d623fde04f97a2a0200aa1b/ros2.repos |
All looks green, shall we merge this? :] |
…ge (#603) * Mutex-protected writes and topic creation/removal in sqlite_storage Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
…ge (#603) * Mutex-protected writes and topic creation/removal in sqlite_storage Signed-off-by: Adam Dabrowski <adam.dabrowski@robotec.ai>
Adresses the #602 issue.
The issue is with that the current locking mechanism on top level (mutexes in writer.cpp in rosbag2_cpp, added with #491) is valid only with assumption that all writes of data and all creation/removal of topics are synchronous. This is not the case since double buffering has been introduced in the writer. This also would not be the case if another, non-sequential writer implementation was added, and would be unnecessary for another, parallel-enabled storage implementation.
This PR however does not remove mutexing on the rosbag2_cpp level, only adds storage implementation level locks - lets discuss/test whether remaining (intermittent) layers are multithreading-ready.