Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block-Max WAND implementation proof of concept #812

Closed
wants to merge 4 commits into from

Conversation

elshize
Copy link

@elshize elshize commented Apr 17, 2020

This is a proof-of-concept implementation of BMW algorithm for union queries. There are still some parts missing in the algorithm itself, but also I haven't touched anything related to indexing yet.

Two things in particular I want to point out for now:

  • BMW requires sorting of posting lists; I wrapped them in Box thinking that it should be faster to move than lists themselves; not yet sure this is the best way, I haven't spent too much time thinking about it so far;
  • BMW needs that additional piece of feedback from the collector: the current threshold; right now, I am passing a predicate but it definitely has its disadvantages. one alternative is to have someone, say, the collector, set that between steps; again, not entirely sure what the way to do this is.

Any comments at this stage would be appreciated.

@fulmicoton
Copy link
Collaborator

Super impressive work! I will need a little bit of time to digest it but we can probably land this in the next release.

@elshize
Copy link
Author

elshize commented Apr 19, 2020

Sure, take your time. I still need to understand better what needs to be done about schema and indexing, and how to approach it.

@fulmicoton
Copy link
Collaborator

I imported your branch into the repo and started working with an integration (especially second point)
https://github.com/tantivy-search/tantivy/compare/elshize-bmw?expand=1

@elshize
Copy link
Author

elshize commented May 4, 2020

Nice. I'll have a look.

fn block_max_score(&mut self) -> Score {
self._score(
self.fieldnorm_reader
.fieldnorm_id(self.postings.block_max_doc()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is inaccurate isn't it? The max score in the block is not necessarily reached for the posting list with the highest term freq in the block I believe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deals with only one posting list, doesn't it? So there's only one block for one term, and so the only difference between them is the term frequency. The function would have to not be monotonic wrt. term frequency in order for this to be false. In fact, I believe the error I made was in the line below, where it should be self.postings.block_max_term_freq(). Or am I completely missing something here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fieldnorm is the number of tokens per doc. It is not a constant over the block.

To get an upper bound of the score you would have to take the min fieldnorm, and the max termfreq over the block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I see what you mean. The idea is to precompute that block max structure such that you don't store the max frequency but the frequency of the document with the max score.

Still, I realize that I made the following mistake: The block-max structure were supposed to be postings of this format: (last document in the block, frequency of max-score posting within the block). This is so that you can find the block given your current document. But I also used that document ID as if it were the ID of the max-score document, which it is not.

In short, we need to store additional info. Off the top of my head, we can either (1) directly store scores instead of frequencies, or (2) store the document ID of the best posting as well.

In PISA, we do (1): We build the block max stuff per scoring function (bm25, query likelihood, etc.) and support storing it as raw uncompressed floats or compressed quantized integers.

The very easiest (but suboptimal) way of fixing this would probably be to have yet another posting list with document IDs of the max-score postings. But this immediately increases the size of the structure and unnecessarily stores block bound IDs twice. I'm not sure how difficult would be to generalize the segment postings to be able to store either floats or pairs of integers but I can take a look sometime this week. Reading quantized scores should be easy but that is additional complexity for the user if at any point you want to introduce other scoring functions.

Let me know what you think about all these and, as I said, I'll take a closer look soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to precompute that block max structure such that you don't store the max frequency but the frequency of the document with the max score.

That would work, but then this definition is contingent on the scoring function used... And the main benefit of storing the term freq for the maxscore and the fieldnorm for the doc with the max score disappears. We cannot use this with different scoring function.

One benefit I guess, could lie in the encoding opportunity it opens... but I don't think it should be part of the trait.

I suggest we simply encode the bm25 max score on a f16 straight in the skip list.,. at least for the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're right, it will be tied to a scoring function one way or another. We can start with floats and see how it goes. We can think of improving it later.

@fulmicoton
Copy link
Collaborator

@elshize

I made a lot of change in tantivy to make the integration with blockwand possible.
Let me explain the detail here, it might be a bit long.

Available in master
The API for DocSets has changed. Just like in PISA, DocSet are now positionned on their first doc when they are created. After all docs have been consumed, they located on a sentinel DocId called TERMINATED.
.skip is now called .seek, and it does not necessarily advance the DocSet.
It should simplify a lot of the juggling you were doing.

Available in a branch called blockwand... BlockWand information is encoded in the skip list.
It is made available under the form of the term_freq and the fieldnorm_id of the document that achieves the best BM25 score (the parameter are not configurable yet) assuming the average fieldnorm is the same as the segment.
At search time, we apply blockwand and compute block max score using the correct fieldnorm.

This is -I believe- a decent approximation. The approximation fails if the average fieldnorm estimation on the segment and the actual fieldnorm were different enough that the optimal (fieldnorm_id, term frequency) in the block are different.
Right now the last block (the incomplete one) return +infty. I think this might hurt performance very badly for low doc_freq terms, so I will change the code and have it compute the block max score for this block only.

The posting list exposes some of the skip list API. In particular, shallow_advance is now available.

Some logic is already diverting the collection of TopScore collector on a union of terms that
have block wand information to a special block wand implementation in block_wand.rs.

That implementation is not working properly and is not properly tested, but shows the API,
and shows that it is not required to implement a DocSet anymore.

The collection callback and threshold is implemented using the following prototype:

pub fn block_wand(
    mut scorers: Vec<TermScorer>,
    mut threshold: f32,
    callback: &mut dyn FnMut(u32, Score) -> Score,
)

Would you like to resume your work by taking over the implementation of blockwand, i.e working on the blockwand branch, and implementing the block_wand function in block_wand.rs ? (you can trash everything in that file except the block_wand function prototype)
I am however not sure your code is easily usable considering the changes in the API.

@elshize
Copy link
Author

elshize commented Jun 15, 2020

Thanks @fulmicoton. Hopefully, this coming weekend I'll have some time to take a look at it, and give it a try!

@elshize
Copy link
Author

elshize commented Jun 21, 2020

I haven't had as much time as I had hoped this weekend, so not much progress. But I have some questions:

  • TermScorer::max_score seems to always return max value; how do you plan to implement that logic?
  • The problem with the current API is that it's quite difficult to unit test: you have to define the fieldnorm_reader for instance. I would personally opt for defining a trait that extends both Scorer/DocSet and Postings (or at least exposes postings() method). The logic of BMW is nuanced as is, and adding on top of that score computations makes it difficult to write sensible tests. It's much easier when you just have (doc, score) values to deal with. That said, if there is an easy way to create TermScorer in the test, I could probably work around that by reading those values first and all, but I didn't find how to do that (I know there's a way to create postings but not sure about fieldnorm_reader). Is there a specific reason why you want to have a concrete type in the API here?

@fulmicoton
Copy link
Collaborator

TermScorer::max_score seems to always return max value; how do you plan to implement
that logic?

Ah right, I wanted to ask about TermScorer::max_score(). Is it useful for the implementation or can we remove it? I don't think I saw it reference in the block wand paper? I don't have any good plan on how to implement it yet. As you may know, tantivy only supports BM25 at the moment. Is the asymptotical score for BM25 good enough for what you want to use it for? (IDF x (k1 + 1) )

If it is necessary, it will probably be serialized in the skip list (or somewhere else) when the postings list is longer than 128 docs. When it is smaller, it will be computed and cached on the fly.

The story is the more or less the same of the last block, also the computed on the fly part has not been implemented yet.

The problem with the current API is that it's quite difficult to unit test: you have to define the fieldnorm_reader for instance. I would personally opt for defining a trait that extends both Scorer/DocSet and Postings (or at least exposes postings() method). The logic of BMW is nuanced as is, and adding on top of that score computations makes it difficult to write sensible tests. It's much easier when you just have (doc, score) values to deal with. That said, if there is an easy way to create TermScorer in the test, I could probably work around that by reading those values first and all, but I didn't find how to do that (I know there's a way to create postings but not sure about fieldnorm_reader).
Is there a specific reason why you want to have a concrete type in the API here?

Generally speaking I try to avoid introducing that kind of 1 shot traits. I agree they can add flexibility in the long term, or they can forcibly trim down the API to the mininum and hence help the reader... But in general, my feeling is that it is often overengineering. If there is a need for other type to implement this trait in the future, I will reintroduce the trait. Another reason why I am not fond of that trait is that its contract is really tied to a term scorer working on a block codec. It is not feel like something that can generalize well, and properly documenting the underlying contract would require describing what the TermScorer is doing.

The main divergence with the previous API is that the block wand union is not something we iterate, advance, pass around and seek on any more. It is a function that pushes stuff to a callback. That makes the contract much simpler, and hopefully easier to test.

That said, if there is an easy way to create TermScorer in the test, I could probably work around that by reading those values first and all, but I didn't find how to do that (I know there's a way to create postings but not sure about fieldnorm_reader).

If such an helper does not exist yet, it is probably a good idea to add one.
What do you think woudl be the right API for that?

@elshize
Copy link
Author

elshize commented Jun 22, 2020

Ah right, I wanted to ask about TermScorer::max_score(). Is it useful for the implementation or can we remove it? I don't think I saw it reference in the block wand paper? I don't have any good plan on how to implement it yet. As you may know, tantivy only supports BM25 at the moment. Is the asymptotical score for BM25 good enough for what you want to use it for? (IDF x (k1 + 1) )

We will need the actual score, or an upper bound of one. It is used for selecting the pivot. You can't rely on the block max scores in this case because you cannot know yet in which block the pivot will end up. I'm not sure if there's a way to calculate it with just block max scores; I'll think about it but if so, it would definitely be much less efficient and require some more complex logic.

I wonder if we can afford to just iterate over the skip list and get the max of block maxes. It's not ideal, but at least it's something we could use right now, say, for correctness testing if nothing else.

I hear your concerns about the API. When I get the next chance to look at the code, I'll think about what could be done to help out with testing.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jun 22, 2020

I don't undertsand this part of the algorithm actually. Maybe you can help me?
I'm reading the original paper.
http://engineering.nyu.edu/~suel/papers/bmw.pdf

Since it assumes the lists are pointing to a specific doc and are sorted, can't we take the actual currrent doc score to compute the pivot?
If not, isn't the block max score ok too?

@elshize
Copy link
Author

elshize commented Jun 22, 2020

Take a look at the second paragraph in section 3 "Our Contribution", aided by Figure 4.

Technically, we still could use the block max scores, but not directly (I think). Say, we are about to check whether the current document in the list for dog in the figure is the pivot. Then, we first need to align the skip list above to make sure we're in the list that covers our document. Now, say this isn't the pivot. Then we go to monkey and now we need to align both cat and dog. I think this would be correct, and possibly give a lower upper-bound than the original. But this might be costly to compute.

I'm not sure if this has been ever done as a follow-up, quite possibly but not sure. If it was successful, I would probably know about it, but I can bring it up with my advisor when I talk to him next time.

Either way, I think for now I want to simply compute the full max score at the beginning of the algorithm from the block maxes: It doesn't change the algorithm, and I want to start with something that work correctly and is tested. Then, we should run a benchmark and note the results. And then we can take it from there.

@elshize
Copy link
Author

elshize commented Jun 22, 2020

Ok, so I was slightly wrong describing the working solution: when aligning the skip lists for the previous terms, we need to actually keep the max score between the starting document (when we start finding the pivot) and the current document considered as pivot. This is important but also in no way more complicated to implement. The question of performance is still open, though.

@fulmicoton
Copy link
Collaborator

Ok I think I understood! Thank you! I'll implement something to get the max_score.

It might be a tad tricky, because I do not have a perfect value for fieldnorm at index time.
Lucene fixes this problem by encoding all optimum (fieldnorm_id, tf) pairs. I decide to just store the optimal one for the given average fieldnorm estimate, hoping the average fieldnorm estimate is close enough to be correct at the scale of the block.

If I apply the same approximation at the scale of the posting it is less likely to be correct, so I will need to do something else.

(PISA just ignores dynamic indexing?)

@fulmicoton
Copy link
Collaborator

Lucene seems to be simply using the upper bound coming from BM25. I wonder if this is good enough an approximation.

@elshize
Copy link
Author

elshize commented Jun 23, 2020

PISA doesn't have dynamic indexing, correct. You have to rebuild the index if you want to update it.

Lucene seems to be simply using the upper bound coming from BM25. I wonder if this is good enough an approximation.

I'm not entirely sure what you mean by "coming from BM25".

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jun 23, 2020 via email

@elshize
Copy link
Author

elshize commented Jun 23, 2020

Oh, I see. I'll have to think this over when I get a chance.

@fulmicoton
Copy link
Collaborator

@elshize I just pushed an implementation of block max score for the last block, and the BM25 upperbound for TermScorer.

@elshize
Copy link
Author

elshize commented Jul 11, 2020

Hey, I just wanted to let you know I didn't forget about this but I'm currently extremely busy. Hopefully, I'll find some time for this next month.

@fulmicoton
Copy link
Collaborator

@elshize closing this PR. The other is now merged in master and published in 0.13.

@fulmicoton fulmicoton closed this Aug 19, 2020
@elshize
Copy link
Author

elshize commented Aug 20, 2020

@fulmicoton Thanks, I'm looking forward to checking it out once my life slows down a little.

@fulmicoton
Copy link
Collaborator

@elshize COVID is terrible. Take it ez. Tantivy has a lot of room for your contributions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants