Text index inverted index compression #3563

IvanPleshkov · 2024-02-08T10:21:54Z

Compressing postings list for text field index.

Compression is implemented using bitpacking crate: https://github.com/quickwit-oss/bitpacking. Compression works for immutable state only. Mutable state uses old postings list implementation.

Because bitpacking compresses only fixed-len chunks (128 of u32 in this PR), compressed posting list is divided into chunks, each chunk is compressed separately and compressed data is flattened on one large collection.

Because bitpacking provides only pack/unpack methods, we have to decompress one chunk is we want to check if value is presented in compressed posting list. For intersection of posting list there are special helping visitor, who reuse decompressed chunks and shorten search ranges.

RAM difference

RAM difference was tested on https://storage.googleapis.com/common-datasets-snapshots/arxiv_abstracts-3083016565637815127-2023-06-02-07-26-29.snapshot

Measuring method: /proc/<PID>/status

VmRSS on dev: 10_116 MB
VmRSS on branch: 9_696 MB
There is 420MB difference in VmRSS.

RssAnon on dev: 10_041 MB
RssAnon on branch: 9_621 MB
There is 420MB difference in RssAnon.

Performance difference

Performance was measured on the snapshot mentioned above.
As measurement strategy, I called using Rust client 1000 times filtered recommendation API with 2 different positive points in single thread mode. Call is like:

curl -X POST "http://$QDRANT_HOST/collections/text/points/recommend" \
  -H 'Content-Type: application/json' \
  --data-raw '{
      "positive": [ "000004ce-7a38-478f-8d83-2900a72e1d8d", "000004ce-7a38-478f-8d83-6700a12e173t" ],
      "limit": 10,
      "filter": {
          "should": [
              {
                  "key": "abstract",
                  "match": {
                      "text": "<TEST_TEXT>"
                  }
              }
          ]
      },
      "with_payload": false,
      "with_vector": false
    }' | jq

For small-cardinality TEST_TEXT is Immediate benefits.
For large-cardinality TEST_TEXT is the a.
I checked with telemetry that this texts cover small- and large-cardinality cases.

P95 search time for Immediate benefits request:

DEV: 0.000173356
PR: 0.000119321

P95 search time for The a request:

DEV: 0.030939524
PR: 0.02868633

There is a performance boost. Because such measurement is unexpected, I checked that all search results, used for benchmarking, are the same.
Performance boost can be explained for small-cardinality case, where visitor pattern shorten search ranges while postings intersection. For HNSW case I can only guess that doing binary search in two small collections may take much less time that binary search in one large collection - I have only this explanation

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

log compressed size compress sorted fix unit tests flatten chunks use visitor pattern to avoid multiple decompressions check postings list boundaries debug increasing order check simplify ranges check unit tests for visitor remove debug size measurements fix build

agourlay · 2024-02-20T16:11:00Z

Do you have an end to end test to see if the search latency is impacting by the compression overhead?

IvanPleshkov · 2024-02-21T10:42:09Z

Do you have an end to end test to see if the search latency is impacting by the compression overhead?

I added section with performance measurement results in description

agourlay · 2024-02-22T14:33:33Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

@@ -165,7 +180,7 @@ impl InvertedIndex {
            };
        }
        // Smallest posting is the largest possible cardinality
-        let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap();
+        let smallest_posting = postings.iter().cloned().min().unwrap();


Can we cloned() after min() only?

Would using copied() be possible here? That would be a lot cheaper.

agourlay · 2024-02-22T14:37:57Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

+            .into_iter()
+            .map(|x| x.map(CompressedPostingList::new))
+            .collect();
+        postings.shrink_to_fit();


Why is shrink_to_fit necessary?
I'd assume the collect method would size the vector appropriately.

Yes, assuming size information of the iterator is exact it is unnecessary.

agourlay · 2024-02-22T15:20:48Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

 }

 #[derive(Default)]
 pub struct ImmutableInvertedIndex {
-    postings: Vec<Option<PostingList>>,
+    postings: Vec<Option<CompressedPostingList>>,


Are we saving the ImmutableInvertedIndex to disk with the compressed posting lists at some point?

No, we build it from rocksdb while loading

timvisee · 2024-02-23T10:29:25Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+    data: Box<[u8]>,
+    chunks: Box<[CompressedPostingChunk]>,


Can't we use a vector in these two cases instead, rather than a boxed array?

I placed boxed array to show that we cannot change size. But in general I don't have any doubts. Fixed

timvisee · 2024-02-23T10:32:46Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+    }
+
+    fn find_chunk(&self, doc_id: &PointOffsetType, start_chunk: Option<usize>) -> Option<usize> {
+        let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };


Suggested change

let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };

let start_chunk = start_chunk.unwrap_or_default();

or

Suggested change

let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };

let start_chunk = start_chunk.unwrap_or(0);

I choosed second option, fixed

timvisee · 2024-02-23T10:33:45Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+            Err(idx) => {
+                if idx > 0 {
+                    Some(start_chunk + idx - 1)
+                } else {
+                    None
+                }
+            }


Not necessary, but we can have conditional branches:

Suggested change

Err(idx) => {

if idx > 0 {

Some(start_chunk + idx - 1)

} else {

None

}

}

Err(idx) if idx > 0 => Some(start_chunk + idx - 1),

Err(_) => None,

timvisee · 2024-02-23T10:34:30Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+    ) {
+        let chunk = &self.chunks[chunk_index];
+        let chunk_size = Self::get_chunk_size(&self.chunks, &self.data, chunk_index);
+        let chunk_bits = (chunk_size * 8) / BitPackerImpl::BLOCK_LEN;


Are we sure we never have any remainder here?

We are sure about it because chunk_size is defined as chunk_bits * BitPackerImpl::BLOCK_LEN / 8, BitPackerImpl::BLOCK_LEN is 128

generall · 2024-02-23T12:56:29Z

VmRSS on dev: 10_116 MB
VmRSS on branch: 9_696 MB

RssAnon could be more relevant

IvanPleshkov · 2024-02-23T13:00:32Z

VmRSS on dev: 10_116 MB VmRSS on branch: 9_696 MB

RssAnon could be more relevant

Added RssAnon measures to PR description

generall · 2024-02-26T09:53:45Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

        };
        let postings_opt: Option<Vec<_>> = query
            .tokens
            .iter()
            .map(|&vocab_idx| match vocab_idx {
                None => None,
                // unwrap safety: same as in filter()
-                Some(idx) => index_postings.get(idx as usize).unwrap().as_ref(),
+                Some(idx) => match &self {


postings_opt -> posting_lengths?

generall · 2024-02-26T10:04:38Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

@@ -165,7 +180,7 @@ impl InvertedIndex {
            };
        }
        // Smallest posting is the largest possible cardinality
-        let smallest_posting = postings.iter().map(|posting| posting.len()).min().unwrap();
+        let smallest_posting = postings.iter().min().copied().unwrap();


Also I think rust-analyzer have troubles to infer postings_opt type now.

It shows me impl Iterator<Item = Option<...>> before the collect, but where another layer of option goes, assuming that postings is Vec<usize>

Added Vec<usize> type

generall · 2024-02-26T10:18:28Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+            return Self::default();
+        }
+        let len = posting_list.len();
+        let last_doc_id = *posting_list.list.last().unwrap();


Is it expected to be obtined before sorting?

Removed, thanks!

generall · 2024-02-26T10:28:16Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+
+        let last = *posting_list.list.last().unwrap();
+        while posting_list.list.len() % BitPackerImpl::BLOCK_LEN != 0 {
+            posting_list.list.push(last);


I am not sure this is a good heuristic.
If we assume that data is distributed according to Zipf, only first few thousands of the vocab tokens will have enough occurances, while the long tail of the smaller posting lists will actually be bloated with this extra "alignment".

I propose to make a Enum of postings, which can deside for it's own which impelemtation to choose.

Fixed. I implemented it simpler. Remainder data is uncompressed. For small postings list we contain add data as remainder

generall · 2024-02-26T11:24:00Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+        }
+    }
+
+    pub fn contains(&mut self, val: &PointOffsetType) -> bool {


It looks like this method have undocumented "side effects". Maybe it is better to call it contains_next or something?

Also docstring might help

I agree, the &mut tripped me as well on a contains method 👍

This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next

generall

Needs refactoring

timvisee · 2024-02-28T09:47:56Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+    // Check if the next value is in the compressed posting list.
+    // This function reuses the decompressed chunk to avoid unnecessary decompression.
+    // It is useful when the visitor is used to check the values in the increasing order.
+    pub fn contains_next(&mut self, val: &PointOffsetType) -> bool {


Could we extend the comment to explain why &mut is needed?

I assume it's there to be careful about something specifically.

Commened the same question below. This method is defined in separate help structure which reuse decompressed data to avoid unnecessary decompression. Help structure was commented. Added comments also to this function. And renamed into contains_next

contains_next still doesn't highlight the need for mutability.

generall · 2024-02-28T12:06:16Z

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

+                        .unwrap()
+                        .as_ref()
+                        .map(|p| p.len()),
+                },
            })
            .collect();


I probably need an extra explanation why we have double Option around PostingList?

And how Option of p.len() is collected into just Vec<usize>? Do we skip Nones? If we have no posting lists for token, shouldn't we consider it as length=0?

Agree that this Options are unclear. I think that we can simplify the whole inverted index and remove a lot of obsolete Options. But right now I'm not sure that there are no any place where Option is unnecessary. I Propose to fix it as separate PR, created issue for that:
#3716

generall · 2024-02-28T12:15:53Z

lib/segment/src/index/field_index/full_text_index/posting_list.rs

+        self.find_in_decompressed_chunk(val)
+    }
+
+    fn find_in_decompressed_chunk(&mut self, val: &PointOffsetType) -> bool {


same here, find_in_decompressed_chunk is a name for read-only method

It's not a method of postings list. It's a method for help structure for intersections. It's mutable because we do binary search inside not from 0. We do it from index, which we found from previous contains_next check

But I agree that it's an incorrect naming. I have to mark in function name that I change internal state of help stucture

In personal conversation @generall recommended to use naming pattern like do_xxx_and_advance for such cases. Renamed this method into find_in_decompressed_and_advance, also renamed contains_next -> contains_next_and_advance

* compressed posting list definition log compressed size compress sorted fix unit tests flatten chunks use visitor pattern to avoid multiple decompressions check postings list boundaries debug increasing order check simplify ranges check unit tests for visitor remove debug size measurements fix build * more comments * review remarks * are you happy clippy * don't compress remainder * review remarks * rename methods

IvanPleshkov force-pushed the text-index-immutable-state-without-documents branch 2 times, most recently from fa25bd0 to 1b1db47 Compare February 14, 2024 10:30

Base automatically changed from text-index-immutable-state-without-documents to dev February 14, 2024 11:09

IvanPleshkov force-pushed the text-index-inverted-index-compression branch from cafcb7e to 5aae0b8 Compare February 14, 2024 14:32

IvanPleshkov force-pushed the text-index-inverted-index-compression branch from 5aae0b8 to 466cefc Compare February 20, 2024 08:37

more comments

d0bdb5a

github-actions bot mentioned this pull request Feb 20, 2024

Flaky test index::hnsw_index::tests::test_graph_connectivity::test_graph_connectivity #2875

Open

IvanPleshkov marked this pull request as ready for review February 20, 2024 10:49

IvanPleshkov requested review from timvisee, generall and agourlay February 21, 2024 10:42

agourlay reviewed Feb 22, 2024

View reviewed changes

timvisee reviewed Feb 23, 2024

View reviewed changes

review remarks

40a1c66

are you happy clippy

1a0c5be

IvanPleshkov requested a review from timvisee February 23, 2024 13:33

generall reviewed Feb 26, 2024

View reviewed changes

generall requested changes Feb 26, 2024

View reviewed changes

don't compress remainder

10ffc94

review remarks

f36ff1f

timvisee reviewed Feb 28, 2024

View reviewed changes

IvanPleshkov requested review from generall, timvisee and agourlay February 28, 2024 10:01

timvisee approved these changes Feb 28, 2024

View reviewed changes

generall reviewed Feb 28, 2024

View reviewed changes

IvanPleshkov mentioned this pull request Feb 28, 2024

Remove unnecessary Vec<Option<PostingsList>> from text field index #3716

Closed

rename methods

9638ee0

IvanPleshkov requested a review from generall February 28, 2024 12:34

generall approved these changes Feb 28, 2024

View reviewed changes

IvanPleshkov merged commit 7b4469d into dev Feb 28, 2024
17 checks passed

IvanPleshkov deleted the text-index-inverted-index-compression branch February 28, 2024 12:47

	let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };
	let start_chunk = start_chunk.unwrap_or_default();

	let start_chunk = if let Some(idx) = start_chunk { idx } else { 0 };
	let start_chunk = start_chunk.unwrap_or(0);

Text index inverted index compression #3563

Text index inverted index compression #3563

Conversation

IvanPleshkov commented Feb 8, 2024 • edited

RAM difference

Performance difference

All Submissions:

New Feature Submissions:

Changes to Core Features:

agourlay commented Feb 20, 2024

IvanPleshkov commented Feb 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

generall commented Feb 23, 2024

IvanPleshkov commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanPleshkov Feb 28, 2024 • edited

Choose a reason for hiding this comment

generall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanPleshkov Feb 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanPleshkov Feb 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanPleshkov commented Feb 8, 2024 •

edited

IvanPleshkov commented Feb 21, 2024 •

edited

IvanPleshkov Feb 28, 2024 •

edited

IvanPleshkov Feb 28, 2024 •

edited

IvanPleshkov Feb 28, 2024 •

edited