-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text index inverted index compression #3563
Conversation
fa25bd0
to
1b1db47
Compare
cafcb7e
to
5aae0b8
Compare
log compressed size compress sorted fix unit tests flatten chunks use visitor pattern to avoid multiple decompressions check postings list boundaries debug increasing order check simplify ranges check unit tests for visitor remove debug size measurements fix build
5aae0b8
to
466cefc
Compare
Do you have an end to end test to see if the search latency is impacting by the compression overhead? |
I added section with performance measurement results in description |
@@ -165,7 +180,7 @@ impl InvertedIndex { | |||
}; | |||
} | |||
// Smallest posting is the largest possible cardinality | |||
let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap(); | |||
let smallest_posting = postings.iter().cloned().min().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cloned()
after min()
only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using copied()
be possible here? That would be a lot cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.into_iter() | ||
.map(|x| x.map(CompressedPostingList::new)) | ||
.collect(); | ||
postings.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is shrink_to_fit
necessary?
I'd assume the collect
method would size the vector appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, assuming size information of the iterator is exact it is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
|
||
#[derive(Default)] | ||
pub struct ImmutableInvertedIndex { | ||
postings: Vec<Option<PostingList>>, | ||
postings: Vec<Option<CompressedPostingList>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saving the ImmutableInvertedIndex
to disk with the compressed posting lists at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we build it from rocksdb while loading
data: Box<[u8]>, | ||
chunks: Box<[CompressedPostingChunk]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use a vector in these two cases instead, rather than a boxed array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I placed boxed array to show that we cannot change size. But in general I don't have any doubts. Fixed
} | ||
|
||
fn find_chunk(&self, doc_id: &PointOffsetType, start_chunk: Option<usize>) -> Option<usize> { | ||
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 }; | |
let start_chunk = start_chunk.unwrap_or_default(); |
or
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 }; | |
let start_chunk = start_chunk.unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choosed second option, fixed
Err(idx) => { | ||
if idx > 0 { | ||
Some(start_chunk + idx - 1) | ||
} else { | ||
None | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, but we can have conditional branches:
Err(idx) => { | |
if idx > 0 { | |
Some(start_chunk + idx - 1) | |
} else { | |
None | |
} | |
} | |
Err(idx) if idx > 0 => Some(start_chunk + idx - 1), | |
Err(_) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
) { | ||
let chunk = &self.chunks[chunk_index]; | ||
let chunk_size = Self::get_chunk_size(&self.chunks, &self.data, chunk_index); | ||
let chunk_bits = (chunk_size * 8) / BitPackerImpl::BLOCK_LEN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we never have any remainder here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sure about it because chunk_size
is defined as chunk_bits * BitPackerImpl::BLOCK_LEN / 8
, BitPackerImpl::BLOCK_LEN
is 128
VmRSS on dev: 10_116 MB RssAnon could be more relevant |
Added RssAnon measures to PR description |
}; | ||
let postings_opt: Option<Vec<_>> = query | ||
.tokens | ||
.iter() | ||
.map(|&vocab_idx| match vocab_idx { | ||
None => None, | ||
// unwrap safety: same as in filter() | ||
Some(idx) => index_postings.get(idx as usize).unwrap().as_ref(), | ||
Some(idx) => match &self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postings_opt
-> posting_lengths
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -165,7 +180,7 @@ impl InvertedIndex { | |||
}; | |||
} | |||
// Smallest posting is the largest possible cardinality | |||
let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap(); | |||
let smallest_posting = postings.iter().min().copied().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think rust-analyzer
have troubles to infer postings_opt
type now.
It shows me impl Iterator<Item = Option<...>>
before the collect, but where another layer of option goes, assuming that postings
is Vec<usize>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Vec<usize>
type
return Self::default(); | ||
} | ||
let len = posting_list.len(); | ||
let last_doc_id = *posting_list.list.last().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected to be obtined before sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
|
||
let last = *posting_list.list.last().unwrap(); | ||
while posting_list.list.len() % BitPackerImpl::BLOCK_LEN != 0 { | ||
posting_list.list.push(last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is a good heuristic.
If we assume that data is distributed according to Zipf, only first few thousands of the vocab tokens will have enough occurances, while the long tail of the smaller posting lists will actually be bloated with this extra "alignment".
I propose to make a Enum of postings, which can deside for it's own which impelemtation to choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I implemented it simpler. Remainder data is uncompressed. For small postings list we contain add data as remainder
} | ||
} | ||
|
||
pub fn contains(&mut self, val: &PointOffsetType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this method have undocumented "side effects". Maybe it is better to call it contains_next
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also docstring might help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the &mut
tripped me as well on a contains
method 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs refactoring
// Check if the next value is in the compressed posting list. | ||
// This function reuses the decompressed chunk to avoid unnecessary decompression. | ||
// It is useful when the visitor is used to check the values in the increasing order. | ||
pub fn contains_next(&mut self, val: &PointOffsetType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extend the comment to explain why &mut
is needed?
I assume it's there to be careful about something specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commened the same question below. This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contains_next
still doesn't highlight the need for mutability.
.unwrap() | ||
.as_ref() | ||
.map(|p| p.len()), | ||
}, | ||
}) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably need an extra explanation why we have double Option around PostingList
?
And how Option of p.len()
is collected into just Vec<usize>
? Do we skip Nones? If we have no posting lists for token, shouldn't we consider it as length=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this Option
s are unclear. I think that we can simplify the whole inverted index and remove a lot of obsolete Option
s. But right now I'm not sure that there are no any place where Option
is unnecessary. I Propose to fix it as separate PR, created issue for that:
#3716
self.find_in_decompressed_chunk(val) | ||
} | ||
|
||
fn find_in_decompressed_chunk(&mut self, val: &PointOffsetType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, find_in_decompressed_chunk
is a name for read-only method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a method of postings list. It's a method for help structure for intersections. It's mutable because we do binary search inside not from 0. We do it from index, which we found from previous contains_next
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree that it's an incorrect naming. I have to mark in function name that I change internal state of help stucture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In personal conversation @generall recommended to use naming pattern like do_xxx_and_advance
for such cases. Renamed this method into find_in_decompressed_and_advance
, also renamed contains_next -> contains_next_and_advance
* compressed posting list definition log compressed size compress sorted fix unit tests flatten chunks use visitor pattern to avoid multiple decompressions check postings list boundaries debug increasing order check simplify ranges check unit tests for visitor remove debug size measurements fix build * more comments * review remarks * are you happy clippy * don't compress remainder * review remarks * rename methods
Compressing postings list for text field index.
Compression is implemented using
bitpacking
crate: https://github.com/quickwit-oss/bitpacking. Compression works for immutable state only. Mutable state uses old postings list implementation.Because
bitpacking
compresses only fixed-len chunks (128 ofu32
in this PR), compressed posting list is divided into chunks, each chunk is compressed separately and compressed data is flattened on one large collection.Because
bitpacking
provides only pack/unpack methods, we have to decompress one chunk is we want to check if value is presented in compressed posting list. For intersection of posting list there are special helping visitor, who reuse decompressed chunks and shorten search ranges.RAM difference
RAM difference was tested on https://storage.googleapis.com/common-datasets-snapshots/arxiv_abstracts-3083016565637815127-2023-06-02-07-26-29.snapshot
Measuring method:
/proc/<PID>/status
VmRSS on dev:
10_116 MB
VmRSS on branch:
9_696 MB
There is 420MB difference in VmRSS.
RssAnon on dev:
10_041 MB
RssAnon on branch:
9_621 MB
There is 420MB difference in RssAnon.
Performance difference
Performance was measured on the snapshot mentioned above.
As measurement strategy, I called using Rust client 1000 times filtered recommendation API with 2 different positive points in single thread mode. Call is like:
For small-cardinality
TEST_TEXT
isImmediate benefits
.For large-cardinality
TEST_TEXT
isthe a
.I checked with telemetry that this texts cover small- and large-cardinality cases.
P95 search time for
Immediate benefits
request:P95 search time for
The a
request:There is a performance boost. Because such measurement is unexpected, I checked that all search results, used for benchmarking, are the same.
Performance boost can be explained for small-cardinality case, where
visitor
pattern shorten search ranges while postings intersection. For HNSW case I can only guess that doing binary search in two small collections may take much less time that binary search in one large collection - I have only this explanationAll Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?Changes to Core Features: