New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSMKV: Store Map in always sorted manner #1832
Comments
This is meant as a first step to then later be able to change the storage in a way that it is already sorted and we can remove the runtime sort step
This currently still breaks the implementation because of the duplicates in the memtable which needs to be fixed before this is able to pass tests
This required splitting it out to a separate map type. There are a few workarounds with regards to encoding/decoding in place right now to keep all the tests passing and everything working. Once the disk segments are sorted, too, these need to be removed together with the temporary runtime sorting. Still need to skip CI, because the implementaiton does not yet support cursors
This should finally make all existing tests green
This also highlighted that the name mapDecoder no longer makes sense, as it acts more like a map merger now.
The redundant runtime merges are not yet removed. They will follow shortly.
It is now no longer required since the segments themselves are stored sorted.
This still requires version checking as outlined in #1833
The memtable for Map is a binary tree so it's always sorted. However, since this is type 'Map' each "row key" holds a map. This map was unsorted in the past. In #1832 we introduced a change that made sure this change would always be sorted ON DISK, i.e. in the segments. It was very natural to also keep it sorted in the memtable, as we did not have to do any sorting when flushing. However, the performance tests on imports that make heavy use of the inverted index had a large performance degradation after #1832. In a test I did locally the import time went up by over 30%. This fix goes back to keeping the KV pairs unsorted and making each change an append only operation. This means it now needs to be sorted in just two places (as opposed to on every single insertion): 1. On a read query. Those should be rare on memtable, since memtables are mostly meant for writing. The added overhead here (minimal) is not a problem since it was also there before #1832 2. When flushing. Flushing is an async operation and the small overhead of sorting each row's Map KVs is neglible. This new implementation has the same import speed as prior to #1832 while keeping all the runtime benefits of having the KV pairs sorted on disk. closes #1852
Thank you for your contribution to Weaviate. This issue has not received any activity in a while and has therefore been marked as stale. Stale issues will eventually be autoclosed. This does not mean that we are ruling out to work on this issue, but it most likely has not been prioritized high enough in the last months. |
We are seeing during the implementation of #1798 that the current "random order map" implementation for
SegmentStrategyMap
(and same for) cannot handle the load during large inverted index lookups.Set
For example, using the wiki dataset (26M data points) with 1.7M doc ids matching it takes over 600ms just to remove tombstones which is not acceptable for such a query.
EDIT: originally the idea was to update this both for
Map
andSet
. Due to time restrictions I've only implemented this forMap
so far, which is what's used for text/string and therefore for BM25. A follow-up issue will be created to also implement this forSet
, so that numerical filters, etc can also benefit from this.Goals
decoders
are changed to use a much simpler and faster decode, similar to the LSM cursorsBigEndian
byte order into the maps, this way the store sorting can also be used later on for application-level tasks (merging, etc.)Breaking Changes
The text was updated successfully, but these errors were encountered: