docid deltas while indexing #2249

PSeitz · 2023-11-12T14:10:45Z

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before: 760 MB
Now: 590 MB

storing deltas is especially helpful for repetitive data like logs. In those cases, recording a doc on a term costed 4 bytes instead of 1 byte now. HDFS Indexing 1.1GB Total memory consumption: Before: 760 MB Now: 590 MB

adamreichold · 2023-11-12T15:55:37Z

src/postings/recorder.rs

-                VInt32Reader::new(&buffer[..])
-                    .map(|old_doc_id| doc_id_map.get_new_doc_id(old_doc_id)),
-            );
+            doc_ids.extend(VInt32Reader::new(&buffer[..]).map(|delta_doc_id| {


Usage of prev_doc here looks like scan instead of map could make this more obvious.

fulmicoton · 2023-11-13T00:34:54Z

src/postings/recorder.rs

-                VInt32Reader::new(&buffer[..])
-                    .map(|old_doc_id| doc_id_map.get_new_doc_id(old_doc_id)),
-            );
+            doc_ids.extend(VInt32Reader::new(&buffer[..]).map(|delta_doc_id| {


1 in favor of scan of a loop.

Having prev_doc outside of the if-statement is not a good idea. Having it in both statement makes the code easier to (proof)read.

switched to scan, but we can't apply the same technique in the other cases, because they contain a mix of delta and non delta encoded values

* docid deltas while indexing storing deltas is especially helpful for repetitive data like logs. In those cases, recording a doc on a term costed 4 bytes instead of 1 byte now. HDFS Indexing 1.1GB Total memory consumption: Before: 760 MB Now: 590 MB * use scan for delta decoding

docid deltas while indexing

f6512d4

storing deltas is especially helpful for repetitive data like logs. In those cases, recording a doc on a term costed 4 bytes instead of 1 byte now. HDFS Indexing 1.1GB Total memory consumption: Before: 760 MB Now: 590 MB

PSeitz force-pushed the index_delta_docids branch from c11ddec to f6512d4 Compare November 12, 2023 15:05

adamreichold reviewed Nov 12, 2023

View reviewed changes

fulmicoton reviewed Nov 13, 2023

View reviewed changes

fulmicoton approved these changes Nov 13, 2023

View reviewed changes

use scan for delta decoding

0ed6a9e

PSeitz merged commit b60d862 into main Nov 13, 2023
4 checks passed

PSeitz deleted the index_delta_docids branch November 13, 2023 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docid deltas while indexing #2249

docid deltas while indexing #2249

PSeitz commented Nov 12, 2023

adamreichold Nov 12, 2023

fulmicoton Nov 13, 2023

PSeitz Nov 13, 2023

docid deltas while indexing #2249

docid deltas while indexing #2249

Conversation

PSeitz commented Nov 12, 2023

adamreichold Nov 12, 2023

Choose a reason for hiding this comment

fulmicoton Nov 13, 2023

Choose a reason for hiding this comment

PSeitz Nov 13, 2023

Choose a reason for hiding this comment