-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block-Max WAND implementation proof of concept #812
Conversation
Super impressive work! I will need a little bit of time to digest it but we can probably land this in the next release. |
Sure, take your time. I still need to understand better what needs to be done about schema and indexing, and how to approach it. |
I imported your branch into the repo and started working with an integration (especially second point) |
Nice. I'll have a look. |
fn block_max_score(&mut self) -> Score { | ||
self._score( | ||
self.fieldnorm_reader | ||
.fieldnorm_id(self.postings.block_max_doc()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is inaccurate isn't it? The max score in the block is not necessarily reached for the posting list with the highest term freq in the block I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deals with only one posting list, doesn't it? So there's only one block for one term, and so the only difference between them is the term frequency. The function would have to not be monotonic wrt. term frequency in order for this to be false. In fact, I believe the error I made was in the line below, where it should be self.postings.block_max_term_freq()
. Or am I completely missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fieldnorm is the number of tokens per doc. It is not a constant over the block.
To get an upper bound of the score you would have to take the min fieldnorm, and the max termfreq over the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I see what you mean. The idea is to precompute that block max structure such that you don't store the max frequency but the frequency of the document with the max score.
Still, I realize that I made the following mistake: The block-max structure were supposed to be postings of this format: (last document in the block, frequency of max-score posting within the block)
. This is so that you can find the block given your current document. But I also used that document ID as if it were the ID of the max-score document, which it is not.
In short, we need to store additional info. Off the top of my head, we can either (1) directly store scores instead of frequencies, or (2) store the document ID of the best posting as well.
In PISA, we do (1): We build the block max stuff per scoring function (bm25, query likelihood, etc.) and support storing it as raw uncompressed floats or compressed quantized integers.
The very easiest (but suboptimal) way of fixing this would probably be to have yet another posting list with document IDs of the max-score postings. But this immediately increases the size of the structure and unnecessarily stores block bound IDs twice. I'm not sure how difficult would be to generalize the segment postings to be able to store either floats or pairs of integers but I can take a look sometime this week. Reading quantized scores should be easy but that is additional complexity for the user if at any point you want to introduce other scoring functions.
Let me know what you think about all these and, as I said, I'll take a closer look soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to precompute that block max structure such that you don't store the max frequency but the frequency of the document with the max score.
That would work, but then this definition is contingent on the scoring function used... And the main benefit of storing the term freq for the maxscore and the fieldnorm for the doc with the max score disappears. We cannot use this with different scoring function.
One benefit I guess, could lie in the encoding opportunity it opens... but I don't think it should be part of the trait.
I suggest we simply encode the bm25 max score on a f16 straight in the skip list.,. at least for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're right, it will be tied to a scoring function one way or another. We can start with floats and see how it goes. We can think of improving it later.
I made a lot of change in tantivy to make the integration with blockwand possible. Available in master Available in a branch called This is -I believe- a decent approximation. The approximation fails if the average fieldnorm estimation on the segment and the actual fieldnorm were different enough that the optimal The posting list exposes some of the skip list API. In particular, Some logic is already diverting the collection of TopScore collector on a union of terms that That implementation is not working properly and is not properly tested, but shows the API, The collection callback and threshold is implemented using the following prototype:
Would you like to resume your work by taking over the implementation of blockwand, i.e working on the |
Thanks @fulmicoton. Hopefully, this coming weekend I'll have some time to take a look at it, and give it a try! |
I haven't had as much time as I had hoped this weekend, so not much progress. But I have some questions:
|
Ah right, I wanted to ask about If it is necessary, it will probably be serialized in the skip list (or somewhere else) when the postings list is longer than 128 docs. When it is smaller, it will be computed and cached on the fly. The story is the more or less the same of the last block, also the computed on the fly part has not been implemented yet.
Generally speaking I try to avoid introducing that kind of 1 shot traits. I agree they can add flexibility in the long term, or they can forcibly trim down the API to the mininum and hence help the reader... But in general, my feeling is that it is often overengineering. If there is a need for other type to implement this trait in the future, I will reintroduce the trait. Another reason why I am not fond of that trait is that its contract is really tied to a term scorer working on a block codec. It is not feel like something that can generalize well, and properly documenting the underlying contract would require describing what the TermScorer is doing. The main divergence with the previous API is that the block wand union is not something we iterate, advance, pass around and seek on any more. It is a function that pushes stuff to a callback. That makes the contract much simpler, and hopefully easier to test.
If such an helper does not exist yet, it is probably a good idea to add one. |
We will need the actual score, or an upper bound of one. It is used for selecting the pivot. You can't rely on the block max scores in this case because you cannot know yet in which block the pivot will end up. I'm not sure if there's a way to calculate it with just block max scores; I'll think about it but if so, it would definitely be much less efficient and require some more complex logic. I wonder if we can afford to just iterate over the skip list and get the max of block maxes. It's not ideal, but at least it's something we could use right now, say, for correctness testing if nothing else. I hear your concerns about the API. When I get the next chance to look at the code, I'll think about what could be done to help out with testing. |
I don't undertsand this part of the algorithm actually. Maybe you can help me? Since it assumes the lists are pointing to a specific doc and are sorted, can't we take the actual currrent doc score to compute the pivot? |
Take a look at the second paragraph in section 3 "Our Contribution", aided by Figure 4. Technically, we still could use the block max scores, but not directly (I think). Say, we are about to check whether the current document in the list for dog in the figure is the pivot. Then, we first need to align the skip list above to make sure we're in the list that covers our document. Now, say this isn't the pivot. Then we go to monkey and now we need to align both cat and dog. I think this would be correct, and possibly give a lower upper-bound than the original. But this might be costly to compute. I'm not sure if this has been ever done as a follow-up, quite possibly but not sure. If it was successful, I would probably know about it, but I can bring it up with my advisor when I talk to him next time. Either way, I think for now I want to simply compute the full max score at the beginning of the algorithm from the block maxes: It doesn't change the algorithm, and I want to start with something that work correctly and is tested. Then, we should run a benchmark and note the results. And then we can take it from there. |
Ok, so I was slightly wrong describing the working solution: when aligning the skip lists for the previous terms, we need to actually keep the max score between the starting document (when we start finding the pivot) and the current document considered as pivot. This is important but also in no way more complicated to implement. The question of performance is still open, though. |
Ok I think I understood! Thank you! I'll implement something to get the max_score. It might be a tad tricky, because I do not have a perfect value for fieldnorm at index time. If I apply the same approximation at the scale of the posting it is less likely to be correct, so I will need to do something else. (PISA just ignores dynamic indexing?) |
Lucene seems to be simply using the upper bound coming from BM25. I wonder if this is good enough an approximation. |
PISA doesn't have dynamic indexing, correct. You have to rebuild the index if you want to update it.
I'm not entirely sure what you mean by "coming from BM25". |
Bm25 is very different from tfidf in that the bm25 for a term for a given
doc is bounded.
So even if I don't know anything about fieldnorms and term frequencies I
can give a semi decent upper bound of the bm25 score of any doc.
Le mar. 23 juin 2020 à 20:37, Michał Siedlaczek <notifications@github.com>
a écrit :
… PISA doesn't have dynamic indexing, correct. You have to rebuild the index
if you want to update it.
Lucene seems to be simply using the upper bound coming from BM25. I wonder
if this is good enough an approximation.
I'm not entirely sure what you mean by "coming from BM25".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#812 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHZMQXJ5ISLVA56JSIAZMTRYCHYJANCNFSM4MLAVORA>
.
|
Oh, I see. I'll have to think this over when I get a chance. |
@elshize I just pushed an implementation of block max score for the last block, and the BM25 upperbound for TermScorer. |
Hey, I just wanted to let you know I didn't forget about this but I'm currently extremely busy. Hopefully, I'll find some time for this next month. |
@elshize closing this PR. The other is now merged in master and published in 0.13. |
@fulmicoton Thanks, I'm looking forward to checking it out once my life slows down a little. |
@elshize COVID is terrible. Take it ez. Tantivy has a lot of room for your contributions :) |
This is a proof-of-concept implementation of BMW algorithm for union queries. There are still some parts missing in the algorithm itself, but also I haven't touched anything related to indexing yet.
Two things in particular I want to point out for now:
Box
thinking that it should be faster to move than lists themselves; not yet sure this is the best way, I haven't spent too much time thinking about it so far;Any comments at this stage would be appreciated.