Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vsencoding for positionlist #51

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

ShangtongZhang
Copy link

apply vs encoding for positionlist

BitReader rd;

/* We use an algorithm named vsencoding to encode the position list
* when the position list is used for storing document length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, I'm rather confused by this comment in particular, and what this patch is trying to do in general.

Currently, the document lengths are stored as a special post list, not a position list.

Have you just got the two mixed up, or are you actually proposing we store document lengths in the position list table instead? If so, why does that seem like a better place for them? I would have thought that we want the document lengths to be a chunked list (so we can efficiently skip, and so we can efficiently insert and delete documents from the middle), and the postlist table is set up to handle chunked lists already. Basically, document lengths are much more like a posting list than they are like a position list.

If vsencoding is to be used only when the position list is used to store document length (as the comment says), then why have you removed the code to handle interpolative coding of positions? If your plan is that term positions within documents should also be encoded with vsencoding instead of interpolative (and the comment here is wrong), then have you done tests to see what effect that has on space and speed performance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I was a little confused when I wrote this comment. In fact this comment is totally wrong. I have corrected it. Actually I mean we use vsencoding for all positions lists, and discard interpolative encoding, so I delete the code about interpolative encoding. And here is a simple test. http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#June30

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That journal entry says:

Current vsencoding decrease decoding time by 18%, at the cost of twice as much as encoding time and space.

I guess we might be able to bring the encoding time down, but double the space for positional data seems to be just too much to me. In particular, in the cold cache case where we have to read a lot of positional data from disk it's the time to read the data from disk which dominates, and twice as much data means twice as much time to read it from disk. We've worked hard to bring the time down in this case, and making it take twice as long again would undo a lot of that work.

The "my original vsencoding" row shows vsencoding can be more compact, but it's still 37% larger, and at quite extreme encoding time cost, while being 41% slower to decode.

I wonder if positional data is the wrong place to be trying to use vsencoding for us, as we already have a very effective compression in use there. Perhaps posting list data would make more sense, as currently that just uses variable width integers encoding the deltas in a sorted list - a much softer target.

ZhangShangtong added 2 commits July 26, 2014 09:17
@ojwb ojwb closed this Dec 10, 2015
@ojwb
Copy link
Contributor

ojwb commented Jan 7, 2016

I seem to have managed to close all open PRs 28 days ago, which wasn't intended - reopening.

@ojwb ojwb reopened this Jan 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants