New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSMKV: Object Count #1811
Comments
This primitive implementation does not handle updates andd deletes yet and must be adapted in a future commit
currently only works for unique additions, does not yet handle udpates or deletes
This does not yet check for disk segments, so this only works if all disk segments consist of unique additions, but does work if the memtable contains updates or deletes.
currently this process is skipped during compactions, so the values are not reliable for compacted segments. Has not yet been evaluated for performance.
also introduce new count api in meta count
Thank you for your contribution to Weaviate. This issue has not received any activity in a while and has therefore been marked as stale. Stale issues will eventually be autoclosed. This does not mean that we are ruling out to work on this issue, but it most likely has not been prioritized high enough in the last months. |
Background
This is a requirement for #1798. Currently, the only way to count objects in an
lsmkv.Bucket
is to iterate over all of them. Such an iteration is currently implemented for{ Aggregate { meta { count } } }
which is orders of magnitudes too slow to be part of a BM25 query. This is because a simple count of keys does not contain enough information to determine the true count: A key could contain a tombstone and we don't know if a previous segment contained objects for those keys. Similarly, a key could either be a create or an update.Proposed Algorithms
I could not find any info if there is a standard algorithm for an efficient count in LSM-based indices, so I came up with the following which should be fairly efficient without too much overhead:
replace
,set
,map
), a previously seen key means that the count is not changedWhen a new object comes in, determine which of the 4 states it is in..Count()
is called this information can be aggregated. If such a runtime aggregation is still too slow, it could be cached in the memtable with a new write occurring as the cache invalidation event.Notes
Once present, this new algorithm can also be used to considerably speed up
{ Aggregate { meta { count } } }
.The text was updated successfully, but these errors were encountered: