New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IDF factor in LM-JM equation (#779) #230
base: master
Are you sure you want to change the base?
Conversation
…y idf factor in it.
We can't merge a patch that causes tests to fail. This one is particular suspicious:
If you look at the code, |
Also, https://travis-ci.org/xapian/xapian/jobs/519317211 reports some coding standards issue related to whitespace. |
I'm somewhat unclear on exactly what you are saying we should be changing here and especially why. The paper you attached to the ticket has highlighting (which I guess you must have added?) on:
That's the formula you've changed the code to use, except you haven't included the log(), and that makes a major difference to how this will work for queries with more than one term - log(a) + log(b) = log(a * b) so the weight contributions effectively multiply with the log(), but add without it. I'm also unclear if you're saying that the current implementation is buggy (in which case replacing it probably does make sense), or if this is a more sophisticated version of what we have (in which case it probably should be added as a new Xapian::Weight::SMOOTHING_XXX constant). If you're saying this is buggy as-is, then are you also saying the other current smoothing options are buggy too? |
Oh the So the proposed change seems to actually just be dividing each I'm not following in what way this is "adding an idf factor". |
Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails. I thought I'd do one small change at a time, and modify the formula for the other smoothing methods once this is approved. If you would rather I make the intended change for all smoothing methods, I could do that too. |
Already taken care of (as you note below).
Indeed, yes, I feel the other smoothing formulae should also be modified (as suggested by my explanation of why the tests are failing). I think this bug is based on a mis-representation / misunderstanding of the LM weighting formulae. Using more or less self-explanatory names, the vanilla LM-JM formula (as presented in some documents, and as implemented currently in Xapian) is, roughly: P(w|D) = lambda * tf / doclen + (1-lambda) * cf / collection size I think an analogous argument will apply to the other formulae too. I can work out the details now if you like. Or I could run the 'fixed' JM formulae on test collections and check MAP / NDCG values to make sure it works better on average than the original, and then go from there. |
I don't think it is correct to multiply number_of_weighted_terms by log(param_smoothing1 * weight_collection), since the second term is different for different terms. Indeed, for single term queries, taking IDF into account will not make a difference; it only serves to give more importance to the rarer query terms, and less importance to the more common query terms. Please let me know if the explanation in the comment above makes things clearer. |
weight_sum = weight_document; | ||
else | ||
weight_sum = 1 + ((1 - param_smoothing1) / param_smoothing1) * | ||
weight_document / weight_collection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still misaligned - your current indentation suggests to the reader that weight_document / weight_collection;
is a separate statement, which is misleading.
The rule for wrapping at a dyadic operator (*
in this case) is to align the start of its two operands vertically.
(That's the Xapian rule at least, though I struggle to see another sensible rule for this case).
Ah yes, I missed that After re-reading the whole of the p179-zhai paper I think I understand what you're saying here. Currently LMWeight ranks using the smoothed maximum likelihood estimate probabilities, but that fails to take the "unseen" terms into account. You didn't seem to answer my question about whether the current implementation is buggy or whether your proposed change is essentially an enhancement. My reading of the p179-zhai paper suggests "buggy" but I haven't gone back and read earlier papers.
We just can't merge a change which causes tests to fail. If the tests are testing based on an assumption which merely happens to be true for the current implementation then they can be adjusted to not assume that. If this is a property which should still be true then changing the tests wouldn't be right. I'm not sure which is the case here. I can see a certain logic to 0 meaning the same for each smoothing, but I'm not sure where that idea came from (maybe @samuelharden can remember). It does seem a bit unclean that you introduce a discontinuity in behaviour as |
Yes, I think these are buggy. We can put it as bug fixing.
Especially choice between (1 large patch) and (several smaller patches but disable the tests because we know why they're failing, and re-enable them later once all smoothing methods have been modified).
Yeah, a bit unclean, but will work correctly as intended. Do you have any suggestions for this? |
I made two suggestions above:
The first seems better (and is even probably slightly more efficient as it saves one division per termweight calculated). |
LM-JM equation doesn't have IDF factor involved in it. This commit contains LM-JM with IDF effect. I have highlighted the LM-JM with IDF factor in #779 ticket. Due to some setup issue in my machine, I am unable to run this on various data sets. Could you please review and let me know for any concerns.