Adding IDF factor in LM-JM equation (#779) #230

souravsaha · 2019-04-12T15:40:23Z

LM-JM equation doesn't have IDF factor involved in it. This commit contains LM-JM with IDF effect. I have highlighted the LM-JM with IDF factor in #779 ticket. Due to some setup issue in my machine, I am unable to run this on various data sets. Could you please review and let me know for any concerns.

…y idf factor in it.

ojwb · 2019-04-14T23:43:29Z

We can't merge a patch that causes tests to fail.

This one is particular suspicious:

api_weight.cc:1474: TEST_EQUAL_DOUBLE_((mset2[i].get_weight()), (mset4[i].get_weight()))
Expected 'mset2[i].get_weight()' and 'mset4[i].get_weight()' to be (nearly) equal: were inf and 6.57128304236092)

If you look at the code, mset2 is produced using Xapian::Weight::JELINEK_MERCER_SMOOTHING, and this is showing that a document is getting an infinite weight. Weight contributions need to be bounded above, so that can't be valid.

ojwb · 2019-04-14T23:44:36Z

Also, https://travis-ci.org/xapian/xapian/jobs/519317211 reports some coding standards issue related to whitespace.

xapian-core/weight/lmweight.cc

ojwb · 2019-04-15T00:07:45Z

I'm somewhat unclear on exactly what you are saying we should be changing here and especially why.

The paper you attached to the ticket has highlighting (which I guess you must have added?) on:

The term weight is log(1 + (1 − λ) p_ml (q_i | d )/(λp(q_i | C)))

That's the formula you've changed the code to use, except you haven't included the log(), and that makes a major difference to how this will work for queries with more than one term - log(a) + log(b) = log(a * b) so the weight contributions effectively multiply with the log(), but add without it.

I'm also unclear if you're saying that the current implementation is buggy (in which case replacing it probably does make sense), or if this is a more sophisticated version of what we have (in which case it probably should be added as a new Xapian::Weight::SMOOTHING_XXX constant).

If you're saying this is buggy as-is, then are you also saying the other current smoothing options are buggy too? DIRICHLET_SMOOTHING and ABSOLUTE_DISCOUNT_SMOOTHING both seem to also match the equations in table 1 of the paper, which I think you're trying to say are the wrong equations to be using.

ojwb · 2019-04-15T00:37:58Z

Oh the log() is already done lower down - sorry, I didn't spot that.

So the proposed change seems to actually just be dividing each weight_sum by param_smoothing1 * weight_collection. Assuming param_smoothing1 isn't zero, param_smoothing1 * weight_collection is constant for a given query on a given database, so taking the log() into account that will just decrease the total weight by number_of_weighted_terms * log(param_smoothing1 * weight_collection) won't it?

I'm not following in what way this is "adding an idf factor".

souravsaha · 2019-04-21T13:38:13Z

We can't merge a patch that causes tests to fail.

This one is particular suspicious:
api_weight.cc:1474: TEST_EQUAL_DOUBLE_((mset2[i].get_weight()), (mset4[i].get_weight()))
Expected 'mset2[i].get_weight()' and 'mset4[i].get_weight()' to be (nearly) equal: were inf and 6.57128304236092)
If you look at the code, mset2 is produced using Xapian::Weight::JELINEK_MERCER_SMOOTHING, and this is showing that a document is getting an infinite weight. Weight contributions need to be bounded above, so that can't be valid.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

I thought I'd do one small change at a time, and modify the formula for the other smoothing methods once this is approved. If you would rather I make the intended change for all smoothing methods, I could do that too.

souravsaha · 2019-04-21T13:44:13Z

I'm somewhat unclear on exactly what you are saying we should be changing here and especially why.

The paper you attached to the ticket has highlighting (which I guess you must have added?) on:

The term weight is log(1 + (1 − λ) p_ml (q_i | d )/(λp(q_i | C)))

That's the formula you've changed the code to use, except you haven't included the log(), and that makes a major difference to how this will work for queries with more than one term - log(a) + log(b) = log(a * b) so the weight contributions effectively multiply with the log(), but add without it.

Already taken care of (as you note below).

I'm also unclear if you're saying that the current implementation is buggy (in which case replacing it probably does make sense), or if this is a more sophisticated version of what we have (in which case it probably should be added as a new Xapian::Weight::SMOOTHING_XXX constant).

If you're saying this is buggy as-is, then are you also saying the other current smoothing options are buggy too? DIRICHLET_SMOOTHING and ABSOLUTE_DISCOUNT_SMOOTHING both seem to also match the equations in table 1 of the paper, which I think you're trying to say are the wrong equations to be using.

Indeed, yes, I feel the other smoothing formulae should also be modified (as suggested by my explanation of why the tests are failing). I think this bug is based on a mis-representation / misunderstanding of the LM weighting formulae.

Using more or less self-explanatory names, the vanilla LM-JM formula (as presented in some documents, and as implemented currently in Xapian) is, roughly:

P(w|D) = lambda * tf / doclen + (1-lambda) * cf / collection size
Consider two terms w1 (say politician) and w2 (say brexit) in a single document D, both having the same tf. I expect w1, the more general term, will have a higher total count across all documents in the collection ('cf' in the formula above) than w2. Thus, the weight of w1 in D will be higher than w2 in D. This is contrary to the expectation that the rarer / higher IDF term's weight should be higher (tf remaining the same).
To fix this, in the modified formula, I've divided throughout by ((1-lambda) * cf / collection size) which is equivalent to multiplying by an IDF-like factor.

I think an analogous argument will apply to the other formulae too. I can work out the details now if you like. Or I could run the 'fixed' JM formulae on test collections and check MAP / NDCG values to make sure it works better on average than the original, and then go from there.

souravsaha · 2019-04-21T13:46:34Z

Oh the log() is already done lower down - sorry, I didn't spot that.

So the proposed change seems to actually just be dividing each weight_sum by param_smoothing1 * weight_collection. Assuming param_smoothing1 isn't zero, param_smoothing1 * weight_collection is constant for a given query on a given database, so taking the log() into account that will just decrease the total weight by number_of_weighted_terms * log(param_smoothing1 * weight_collection) won't it?

I'm not following in what way this is "adding an idf factor".

I don't think it is correct to multiply number_of_weighted_terms by log(param_smoothing1 * weight_collection), since the second term is different for different terms. Indeed, for single term queries, taking IDF into account will not make a difference; it only serves to give more importance to the rarer query terms, and less importance to the more common query terms.

Please let me know if the explanation in the comment above makes things clearer.

ojwb · 2019-05-02T20:52:16Z

xapian-core/weight/lmweight.cc

+            weight_sum = weight_document;
+        else 
+            weight_sum = 1 + ((1 - param_smoothing1) / param_smoothing1) * 
+            weight_document / weight_collection;


This is still misaligned - your current indentation suggests to the reader that weight_document / weight_collection; is a separate statement, which is misleading.

The rule for wrapping at a dyadic operator (* in this case) is to align the start of its two operands vertically.

(That's the Xapian rule at least, though I struggle to see another sensible rule for this case).

ojwb · 2019-05-03T03:38:33Z

Ah yes, I missed that weight_collection does vary by term.

After re-reading the whole of the p179-zhai paper I think I understand what you're saying here. Currently LMWeight ranks using the smoothed maximum likelihood estimate probabilities, but that fails to take the "unseen" terms into account.

You didn't seem to answer my question about whether the current implementation is buggy or whether your proposed change is essentially an enhancement. My reading of the p179-zhai paper suggests "buggy" but I haven't gone back and read earlier papers.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

We just can't merge a change which causes tests to fail.

If the tests are testing based on an assumption which merely happens to be true for the current implementation then they can be adjusted to not assume that. If this is a property which should still be true then changing the tests wouldn't be right. I'm not sure which is the case here. I can see a certain logic to 0 meaning the same for each smoothing, but I'm not sure where that idea came from (maybe @samuelharden can remember).

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

souravsaha · 2019-05-03T18:03:10Z

Ah yes, I missed that weight_collection does vary by term.

After re-reading the whole of the p179-zhai paper I think I understand what you're saying here. Currently LMWeight ranks using the smoothed maximum likelihood estimate probabilities, but that fails to take the "unseen" terms into account.

You didn't seem to answer my question about whether the current implementation is buggy or whether your proposed change is essentially an enhancement. My reading of the p179-zhai paper suggests "buggy" but I haven't gone back and read earlier papers.

Yes, I think these are buggy. We can put it as bug fixing.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

We just can't merge a change which causes tests to fail.

If the tests are testing based on an assumption which merely happens to be true for the current implementation then they can be adjusted to not assume that. If this is a property which should still be true then changing the tests wouldn't be right. I'm not sure which is the case here. I can see a certain logic to 0 meaning the same for each smoothing, but I'm not sure where that idea came from (maybe @samuelharden can remember).

Especially choice between (1 large patch) and (several smaller patches but disable the tests because we know why they're failing, and re-enable them later once all smoothing methods have been modified).

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

Yeah, a bit unclean, but will work correctly as intended. Do you have any suggestions for this?

ojwb · 2019-05-03T23:16:32Z

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

Yeah, a bit unclean, but will work correctly as intended. Do you have any suggestions for this?

I made two suggestions above:

not dividing through by param_smoothing1 (this would give the same ranking order for non-zero param_smoothing1 just with weights scaled by param_smoothing1, and when param_smoothing1 == 0 it reduces to weight_document / weight_collection which is also the limit as we tend to zero), or
requiring param_smoothing1 to be non-zero

The first seems better (and is even probably slightly more efficient as it saves one division per termweight calculated).

Adding IDF factor in LM-JM equation. Earlier equation doesn't have an…

776941c

…y idf factor in it.

ojwb reviewed Apr 14, 2019

View reviewed changes

xapian-core/weight/lmweight.cc Outdated Show resolved Hide resolved

fixed the infinity weight problem

a5f81c0

ojwb reviewed May 2, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding IDF factor in LM-JM equation (#779) #230

Adding IDF factor in LM-JM equation (#779) #230

souravsaha commented Apr 12, 2019 •

edited by ojwb

ojwb commented Apr 14, 2019

ojwb commented Apr 14, 2019

ojwb commented Apr 15, 2019

ojwb commented Apr 15, 2019

souravsaha commented Apr 21, 2019

souravsaha commented Apr 21, 2019 •

edited

souravsaha commented Apr 21, 2019

ojwb May 2, 2019

ojwb commented May 3, 2019

souravsaha commented May 3, 2019

ojwb commented May 3, 2019

Adding IDF factor in LM-JM equation (#779) #230

Are you sure you want to change the base?

Adding IDF factor in LM-JM equation (#779) #230

Conversation

souravsaha commented Apr 12, 2019 • edited by ojwb

ojwb commented Apr 14, 2019

ojwb commented Apr 14, 2019

ojwb commented Apr 15, 2019

ojwb commented Apr 15, 2019

souravsaha commented Apr 21, 2019

souravsaha commented Apr 21, 2019 • edited

souravsaha commented Apr 21, 2019

ojwb May 2, 2019

Choose a reason for hiding this comment

ojwb commented May 3, 2019

souravsaha commented May 3, 2019

ojwb commented May 3, 2019

souravsaha commented Apr 12, 2019 •

edited by ojwb

souravsaha commented Apr 21, 2019 •

edited