Vsencoding for positionlist #51

ShangtongZhang · 2014-07-12T14:40:21Z

apply vs encoding for positionlist

write a simple test for the change in dbcheck

This reverts commit dd859b9.

This reverts commit a491785.

This reverts commit 34d6fe0.

This reverts commit b115803.

This reverts commit fe4a2a4.

ojwb · 2014-07-25T22:55:37Z

xapian-core/backends/brass/brass_positionlist.h

-    BitReader rd;
+
+    /* We use an algorithm named vsencoding to encode the position list 
+     * when the position list is used for storing document length.


Um, I'm rather confused by this comment in particular, and what this patch is trying to do in general.

Currently, the document lengths are stored as a special post list, not a position list.

Have you just got the two mixed up, or are you actually proposing we store document lengths in the position list table instead? If so, why does that seem like a better place for them? I would have thought that we want the document lengths to be a chunked list (so we can efficiently skip, and so we can efficiently insert and delete documents from the middle), and the postlist table is set up to handle chunked lists already. Basically, document lengths are much more like a posting list than they are like a position list.

If vsencoding is to be used only when the position list is used to store document length (as the comment says), then why have you removed the code to handle interpolative coding of positions? If your plan is that term positions within documents should also be encoded with vsencoding instead of interpolative (and the comment here is wrong), then have you done tests to see what effect that has on space and speed performance?

I'm sorry I was a little confused when I wrote this comment. In fact this comment is totally wrong. I have corrected it. Actually I mean we use vsencoding for all positions lists, and discard interpolative encoding, so I delete the code about interpolative encoding. And here is a simple test. http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#June30

That journal entry says:

Current vsencoding decrease decoding time by 18%, at the cost of twice as much as encoding time and space.

I guess we might be able to bring the encoding time down, but double the space for positional data seems to be just too much to me. In particular, in the cold cache case where we have to read a lot of positional data from disk it's the time to read the data from disk which dominates, and twice as much data means twice as much time to read it from disk. We've worked hard to bring the time down in this case, and making it take twice as long again would undo a lot of that work.

The "my original vsencoding" row shows vsencoding can be more compact, but it's still 37% larger, and at quite extreme encoding time cost, while being 41% slower to decode.

I wonder if positional data is the wrong place to be trying to use vsencoding for us, as we already have a very effective compression in use there. Perhaps posting list data would make more sense, as currently that just uses variable width integers encoding the deltas in a sorted list - a much softer target.

and modify a corresponding test

ojwb · 2016-01-07T03:15:03Z

I seem to have managed to close all open PRs 28 days ago, which wasn't intended - reopening.

ShangtongZhang and others added 18 commits June 18, 2014 19:27

apply fixed width format to doclen chunk

fe4a2a4

make dbcheck support new doclen format

b115803

write a simple test for the change in dbcheck

optimization about passing params

34d6fe0

write comments

a491785

write comments

dd859b9

Revert "write comments"

279e92d

This reverts commit dd859b9.

Revert "write comments"

4a759f8

This reverts commit a491785.

Revert "optimization about passing params"

2216b2d

This reverts commit 34d6fe0.

Revert "make dbcheck support new doclen format"

14b4f4c

This reverts commit b115803.

Revert "apply fixed width format to doclen chunk"

d08c890

This reverts commit fe4a2a4.

integrate vsencoding

f1b8d69

repair a bug

25a5791

modify code style

1050e5b

write comments

accebef

optimize Unary Encoder

0c23a1f

write comments and optimize GammaEncoder

a4026c5

write comments

20fd577

modify comments

15428cc

ojwb reviewed Jul 25, 2014
View reviewed changes

ZhangShangtong added 2 commits July 26, 2014 09:17

modify a comments

0f99696

update db_check about vsencoding

0d571bc

and modify a corresponding test

ojwb closed this Dec 10, 2015

ojwb reopened this Jan 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vsencoding for positionlist #51

Vsencoding for positionlist #51

ShangtongZhang commented Jul 12, 2014

ojwb Jul 25, 2014

ShangtongZhang Jul 26, 2014

ojwb Feb 27, 2020

ojwb commented Jan 7, 2016

Vsencoding for positionlist #51

Are you sure you want to change the base?

Vsencoding for positionlist #51

Conversation

ShangtongZhang commented Jul 12, 2014

ojwb Jul 25, 2014

Choose a reason for hiding this comment

ShangtongZhang Jul 26, 2014

Choose a reason for hiding this comment

ojwb Feb 27, 2020

Choose a reason for hiding this comment

ojwb commented Jan 7, 2016