Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text index inverted index compression #3563

Merged
merged 7 commits into from
Feb 28, 2024

Conversation

IvanPleshkov
Copy link
Contributor

@IvanPleshkov IvanPleshkov commented Feb 8, 2024

Compressing postings list for text field index.

Compression is implemented using bitpacking crate: https://github.com/quickwit-oss/bitpacking. Compression works for immutable state only. Mutable state uses old postings list implementation.

Because bitpacking compresses only fixed-len chunks (128 of u32 in this PR), compressed posting list is divided into chunks, each chunk is compressed separately and compressed data is flattened on one large collection.

Because bitpacking provides only pack/unpack methods, we have to decompress one chunk is we want to check if value is presented in compressed posting list. For intersection of posting list there are special helping visitor, who reuse decompressed chunks and shorten search ranges.

RAM difference

RAM difference was tested on https://storage.googleapis.com/common-datasets-snapshots/arxiv_abstracts-3083016565637815127-2023-06-02-07-26-29.snapshot

Measuring method: /proc/<PID>/status

VmRSS on dev: 10_116 MB
VmRSS on branch: 9_696 MB
There is 420MB difference in VmRSS.

RssAnon on dev: 10_041 MB
RssAnon on branch: 9_621 MB
There is 420MB difference in RssAnon.

Performance difference

Performance was measured on the snapshot mentioned above.
As measurement strategy, I called using Rust client 1000 times filtered recommendation API with 2 different positive points in single thread mode. Call is like:

curl -X POST "http://$QDRANT_HOST/collections/text/points/recommend" \
  -H 'Content-Type: application/json' \
  --data-raw '{
      "positive": [ "000004ce-7a38-478f-8d83-2900a72e1d8d", "000004ce-7a38-478f-8d83-6700a12e173t" ],
      "limit": 10,
      "filter": {
          "should": [
              {
                  "key": "abstract",
                  "match": {
                      "text": "<TEST_TEXT>"
                  }
              }
          ]
      },
      "with_payload": false,
      "with_vector": false
    }' | jq

For small-cardinality TEST_TEXT is Immediate benefits.
For large-cardinality TEST_TEXT is the a.
I checked with telemetry that this texts cover small- and large-cardinality cases.

P95 search time for Immediate benefits request:

DEV: 0.000173356
PR: 0.000119321

P95 search time for The a request:

DEV: 0.030939524
PR: 0.02868633

There is a performance boost. Because such measurement is unexpected, I checked that all search results, used for benchmarking, are the same.
Performance boost can be explained for small-cardinality case, where visitor pattern shorten search ranges while postings intersection. For HNSW case I can only guess that doing binary search in two small collections may take much less time that binary search in one large collection - I have only this explanation

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@IvanPleshkov IvanPleshkov force-pushed the text-index-immutable-state-without-documents branch 2 times, most recently from fa25bd0 to 1b1db47 Compare February 14, 2024 10:30
Base automatically changed from text-index-immutable-state-without-documents to dev February 14, 2024 11:09
@IvanPleshkov IvanPleshkov force-pushed the text-index-inverted-index-compression branch from cafcb7e to 5aae0b8 Compare February 14, 2024 14:32
log compressed size

compress sorted

fix unit tests

flatten chunks

use visitor pattern to avoid multiple decompressions

check postings list boundaries

debug increasing order check

simplify ranges check

unit tests for visitor

remove debug size measurements

fix build
@IvanPleshkov IvanPleshkov force-pushed the text-index-inverted-index-compression branch from 5aae0b8 to 466cefc Compare February 20, 2024 08:37
@agourlay
Copy link
Member

Do you have an end to end test to see if the search latency is impacting by the compression overhead?

@IvanPleshkov
Copy link
Contributor Author

IvanPleshkov commented Feb 21, 2024

Do you have an end to end test to see if the search latency is impacting by the compression overhead?

I added section with performance measurement results in description

@@ -165,7 +180,7 @@ impl InvertedIndex {
};
}
// Smallest posting is the largest possible cardinality
let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap();
let smallest_posting = postings.iter().cloned().min().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cloned() after min() only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using copied() be possible here? That would be a lot cheaper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.into_iter()
.map(|x| x.map(CompressedPostingList::new))
.collect();
postings.shrink_to_fit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is shrink_to_fit necessary?
I'd assume the collect method would size the vector appropriately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, assuming size information of the iterator is exact it is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

#[derive(Default)]
pub struct ImmutableInvertedIndex {
postings: Vec<Option<PostingList>>,
postings: Vec<Option<CompressedPostingList>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saving the ImmutableInvertedIndex to disk with the compressed posting lists at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we build it from rocksdb while loading

Comment on lines 52 to 53
data: Box<[u8]>,
chunks: Box<[CompressedPostingChunk]>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use a vector in these two cases instead, rather than a boxed array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I placed boxed array to show that we cannot change size. But in general I don't have any doubts. Fixed

}

fn find_chunk(&self, doc_id: &PointOffsetType, start_chunk: Option<usize>) -> Option<usize> {
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };
let start_chunk = start_chunk.unwrap_or_default();

or

Suggested change
let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };
let start_chunk = start_chunk.unwrap_or(0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choosed second option, fixed

Comment on lines 170 to 176
Err(idx) => {
if idx > 0 {
Some(start_chunk + idx - 1)
} else {
None
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but we can have conditional branches:

Suggested change
Err(idx) => {
if idx > 0 {
Some(start_chunk + idx - 1)
} else {
None
}
}
Err(idx) if idx > 0 => Some(start_chunk + idx - 1),
Err(_) => None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

) {
let chunk = &self.chunks[chunk_index];
let chunk_size = Self::get_chunk_size(&self.chunks, &self.data, chunk_index);
let chunk_bits = (chunk_size * 8) / BitPackerImpl::BLOCK_LEN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we never have any remainder here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sure about it because chunk_size is defined as chunk_bits * BitPackerImpl::BLOCK_LEN / 8, BitPackerImpl::BLOCK_LEN is 128

@generall
Copy link
Member

VmRSS on dev: 10_116 MB
VmRSS on branch: 9_696 MB

RssAnon could be more relevant

@IvanPleshkov
Copy link
Contributor Author

VmRSS on dev: 10_116 MB VmRSS on branch: 9_696 MB

RssAnon could be more relevant

Added RssAnon measures to PR description

};
let postings_opt: Option<Vec<_>> = query
.tokens
.iter()
.map(|&vocab_idx| match vocab_idx {
None => None,
// unwrap safety: same as in filter()
Some(idx) => index_postings.get(idx as usize).unwrap().as_ref(),
Some(idx) => match &self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postings_opt -> posting_lengths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -165,7 +180,7 @@ impl InvertedIndex {
};
}
// Smallest posting is the largest possible cardinality
let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap();
let smallest_posting = postings.iter().min().copied().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think rust-analyzer have troubles to infer postings_opt type now.

It shows me impl Iterator<Item = Option<...>> before the collect, but where another layer of option goes, assuming that postings is Vec<usize>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Vec<usize> type

return Self::default();
}
let len = posting_list.len();
let last_doc_id = *posting_list.list.last().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to be obtined before sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!


let last = *posting_list.list.last().unwrap();
while posting_list.list.len() % BitPackerImpl::BLOCK_LEN != 0 {
posting_list.list.push(last);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is a good heuristic.
If we assume that data is distributed according to Zipf, only first few thousands of the vocab tokens will have enough occurances, while the long tail of the smaller posting lists will actually be bloated with this extra "alignment".

I propose to make a Enum of postings, which can deside for it's own which impelemtation to choose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I implemented it simpler. Remainder data is uncompressed. For small postings list we contain add data as remainder

}
}

pub fn contains(&mut self, val: &PointOffsetType) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this method have undocumented "side effects". Maybe it is better to call it contains_next or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also docstring might help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the &mut tripped me as well on a contains method 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next

Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs refactoring

Comment on lines 244 to 247
// Check if the next value is in the compressed posting list.
// This function reuses the decompressed chunk to avoid unnecessary decompression.
// It is useful when the visitor is used to check the values in the increasing order.
pub fn contains_next(&mut self, val: &PointOffsetType) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extend the comment to explain why &mut is needed?

I assume it's there to be careful about something specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commened the same question below. This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains_next still doesn't highlight the need for mutability.

.unwrap()
.as_ref()
.map(|p| p.len()),
},
})
.collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need an extra explanation why we have double Option around PostingList?

And how Option of p.len() is collected into just Vec<usize>? Do we skip Nones? If we have no posting lists for token, shouldn't we consider it as length=0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this Options are unclear. I think that we can simplify the whole inverted index and remove a lot of obsolete Options. But right now I'm not sure that there are no any place where Option is unnecessary. I Propose to fix it as separate PR, created issue for that:
#3716

self.find_in_decompressed_chunk(val)
}

fn find_in_decompressed_chunk(&mut self, val: &PointOffsetType) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, find_in_decompressed_chunk is a name for read-only method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a method of postings list. It's a method for help structure for intersections. It's mutable because we do binary search inside not from 0. We do it from index, which we found from previous contains_next check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree that it's an incorrect naming. I have to mark in function name that I change internal state of help stucture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In personal conversation @generall recommended to use naming pattern like do_xxx_and_advance for such cases. Renamed this method into find_in_decompressed_and_advance, also renamed contains_next -> contains_next_and_advance

@IvanPleshkov IvanPleshkov merged commit 7b4469d into dev Feb 28, 2024
17 checks passed
@IvanPleshkov IvanPleshkov deleted the text-index-inverted-index-compression branch February 28, 2024 12:47
timvisee pushed a commit that referenced this pull request Mar 5, 2024
* compressed posting list definition

log compressed size

compress sorted

fix unit tests

flatten chunks

use visitor pattern to avoid multiple decompressions

check postings list boundaries

debug increasing order check

simplify ranges check

unit tests for visitor

remove debug size measurements

fix build

* more comments

* review remarks

* are you happy clippy

* don't compress remainder

* review remarks

* rename methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants