Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IDF factor in LM-JM equation (#779) #230

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

souravsaha
Copy link

@souravsaha souravsaha commented Apr 12, 2019

LM-JM equation doesn't have IDF factor involved in it. This commit contains LM-JM with IDF effect. I have highlighted the LM-JM with IDF factor in #779 ticket. Due to some setup issue in my machine, I am unable to run this on various data sets. Could you please review and let me know for any concerns.

@ojwb
Copy link
Contributor

ojwb commented Apr 14, 2019

We can't merge a patch that causes tests to fail.

This one is particular suspicious:

api_weight.cc:1474: TEST_EQUAL_DOUBLE_((mset2[i].get_weight()), (mset4[i].get_weight()))
Expected 'mset2[i].get_weight()' and 'mset4[i].get_weight()' to be (nearly) equal: were inf and 6.57128304236092)

If you look at the code, mset2 is produced using Xapian::Weight::JELINEK_MERCER_SMOOTHING, and this is showing that a document is getting an infinite weight. Weight contributions need to be bounded above, so that can't be valid.

@ojwb
Copy link
Contributor

ojwb commented Apr 14, 2019

Also, https://travis-ci.org/xapian/xapian/jobs/519317211 reports some coding standards issue related to whitespace.

@ojwb
Copy link
Contributor

ojwb commented Apr 15, 2019

I'm somewhat unclear on exactly what you are saying we should be changing here and especially why.

The paper you attached to the ticket has highlighting (which I guess you must have added?) on:

The term weight is log(1 + (1 − λ) p_ml (q_i | d )/(λp(q_i | C)))

That's the formula you've changed the code to use, except you haven't included the log(), and that makes a major difference to how this will work for queries with more than one term - log(a) + log(b) = log(a * b) so the weight contributions effectively multiply with the log(), but add without it.

I'm also unclear if you're saying that the current implementation is buggy (in which case replacing it probably does make sense), or if this is a more sophisticated version of what we have (in which case it probably should be added as a new Xapian::Weight::SMOOTHING_XXX constant).

If you're saying this is buggy as-is, then are you also saying the other current smoothing options are buggy too? DIRICHLET_SMOOTHING and ABSOLUTE_DISCOUNT_SMOOTHING both seem to also match the equations in table 1 of the paper, which I think you're trying to say are the wrong equations to be using.

@ojwb
Copy link
Contributor

ojwb commented Apr 15, 2019

Oh the log() is already done lower down - sorry, I didn't spot that.

So the proposed change seems to actually just be dividing each weight_sum by param_smoothing1 * weight_collection. Assuming param_smoothing1 isn't zero, param_smoothing1 * weight_collection is constant for a given query on a given database, so taking the log() into account that will just decrease the total weight by number_of_weighted_terms * log(param_smoothing1 * weight_collection) won't it?

I'm not following in what way this is "adding an idf factor".

@souravsaha
Copy link
Author

We can't merge a patch that causes tests to fail.

This one is particular suspicious:

api_weight.cc:1474: TEST_EQUAL_DOUBLE_((mset2[i].get_weight()), (mset4[i].get_weight()))
Expected 'mset2[i].get_weight()' and 'mset4[i].get_weight()' to be (nearly) equal: were inf and 6.57128304236092)

If you look at the code, mset2 is produced using Xapian::Weight::JELINEK_MERCER_SMOOTHING, and this is showing that a document is getting an infinite weight. Weight contributions need to be bounded above, so that can't be valid.

We can't merge a patch that causes tests to fail.

This one is particular suspicious:

api_weight.cc:1474: TEST_EQUAL_DOUBLE_((mset2[i].get_weight()), (mset4[i].get_weight()))
Expected 'mset2[i].get_weight()' and 'mset4[i].get_weight()' to be (nearly) equal: were inf and 6.57128304236092)

If you look at the code, mset2 is produced using Xapian::Weight::JELINEK_MERCER_SMOOTHING, and this is showing that a document is getting an infinite weight. Weight contributions need to be bounded above, so that can't be valid.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

I thought I'd do one small change at a time, and modify the formula for the other smoothing methods once this is approved. If you would rather I make the intended change for all smoothing methods, I could do that too.

@souravsaha
Copy link
Author

souravsaha commented Apr 21, 2019

I'm somewhat unclear on exactly what you are saying we should be changing here and especially why.

The paper you attached to the ticket has highlighting (which I guess you must have added?) on:

The term weight is log(1 + (1 − λ) p_ml (q_i | d )/(λp(q_i | C)))

That's the formula you've changed the code to use, except you haven't included the log(), and that makes a major difference to how this will work for queries with more than one term - log(a) + log(b) = log(a * b) so the weight contributions effectively multiply with the log(), but add without it.

Already taken care of (as you note below).

I'm also unclear if you're saying that the current implementation is buggy (in which case replacing it probably does make sense), or if this is a more sophisticated version of what we have (in which case it probably should be added as a new Xapian::Weight::SMOOTHING_XXX constant).

If you're saying this is buggy as-is, then are you also saying the other current smoothing options are buggy too? DIRICHLET_SMOOTHING and ABSOLUTE_DISCOUNT_SMOOTHING both seem to also match the equations in table 1 of the paper, which I think you're trying to say are the wrong equations to be using.

Indeed, yes, I feel the other smoothing formulae should also be modified (as suggested by my explanation of why the tests are failing). I think this bug is based on a mis-representation / misunderstanding of the LM weighting formulae.

Using more or less self-explanatory names, the vanilla LM-JM formula (as presented in some documents, and as implemented currently in Xapian) is, roughly:

P(w|D) = lambda * tf / doclen + (1-lambda) * cf / collection size
Consider two terms w1 (say politician) and w2 (say brexit) in a single document D, both having the same tf. I expect w1, the more general term, will have a higher total count across all documents in the collection ('cf' in the formula above) than w2. Thus, the weight of w1 in D will be higher than w2 in D. This is contrary to the expectation that the rarer / higher IDF term's weight should be higher (tf remaining the same).
To fix this, in the modified formula, I've divided throughout by ((1-lambda) * cf / collection size) which is equivalent to multiplying by an IDF-like factor.

I think an analogous argument will apply to the other formulae too. I can work out the details now if you like. Or I could run the 'fixed' JM formulae on test collections and check MAP / NDCG values to make sure it works better on average than the original, and then go from there.

@souravsaha
Copy link
Author

Oh the log() is already done lower down - sorry, I didn't spot that.

So the proposed change seems to actually just be dividing each weight_sum by param_smoothing1 * weight_collection. Assuming param_smoothing1 isn't zero, param_smoothing1 * weight_collection is constant for a given query on a given database, so taking the log() into account that will just decrease the total weight by number_of_weighted_terms * log(param_smoothing1 * weight_collection) won't it?

I'm not following in what way this is "adding an idf factor".

I don't think it is correct to multiply number_of_weighted_terms by log(param_smoothing1 * weight_collection), since the second term is different for different terms. Indeed, for single term queries, taking IDF into account will not make a difference; it only serves to give more importance to the rarer query terms, and less importance to the more common query terms.

Please let me know if the explanation in the comment above makes things clearer.

weight_sum = weight_document;
else
weight_sum = 1 + ((1 - param_smoothing1) / param_smoothing1) *
weight_document / weight_collection;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still misaligned - your current indentation suggests to the reader that weight_document / weight_collection; is a separate statement, which is misleading.

The rule for wrapping at a dyadic operator (* in this case) is to align the start of its two operands vertically.

(That's the Xapian rule at least, though I struggle to see another sensible rule for this case).

@ojwb
Copy link
Contributor

ojwb commented May 3, 2019

Ah yes, I missed that weight_collection does vary by term.

After re-reading the whole of the p179-zhai paper I think I understand what you're saying here. Currently LMWeight ranks using the smoothed maximum likelihood estimate probabilities, but that fails to take the "unseen" terms into account.

You didn't seem to answer my question about whether the current implementation is buggy or whether your proposed change is essentially an enhancement. My reading of the p179-zhai paper suggests "buggy" but I haven't gone back and read earlier papers.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

We just can't merge a change which causes tests to fail.

If the tests are testing based on an assumption which merely happens to be true for the current implementation then they can be adjusted to not assume that. If this is a property which should still be true then changing the tests wouldn't be right. I'm not sure which is the case here. I can see a certain logic to 0 meaning the same for each smoothing, but I'm not sure where that idea came from (maybe @samuelharden can remember).

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

@souravsaha
Copy link
Author

Ah yes, I missed that weight_collection does vary by term.

After re-reading the whole of the p179-zhai paper I think I understand what you're saying here. Currently LMWeight ranks using the smoothed maximum likelihood estimate probabilities, but that fails to take the "unseen" terms into account.

You didn't seem to answer my question about whether the current implementation is buggy or whether your proposed change is essentially an enhancement. My reading of the p179-zhai paper suggests "buggy" but I haven't gone back and read earlier papers.

Yes, I think these are buggy. We can put it as bug fixing.

Sorry! The infinite weight problem has been fixed. Weight contributions are now bounded above. The tests continue to fail, but for a different reason. The tests are based on the hypothesis that if smoothing is set to zero, all smoothing methods should reduce to the same formula. My current patch has only modified JM smoothing to incorporate the claimed IDF factor, but has not touched the other smoothing methods. Not surprisingly, the equality test between between modified JM and other smoothing methods (when the smoothing parameter is set to zero) fails.

We just can't merge a change which causes tests to fail.

If the tests are testing based on an assumption which merely happens to be true for the current implementation then they can be adjusted to not assume that. If this is a property which should still be true then changing the tests wouldn't be right. I'm not sure which is the case here. I can see a certain logic to 0 meaning the same for each smoothing, but I'm not sure where that idea came from (maybe @samuelharden can remember).

Especially choice between (1 large patch) and (several smaller patches but disable the tests because we know why they're failing, and re-enable them later once all smoothing methods have been modified).

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

Yeah, a bit unclean, but will work correctly as intended. Do you have any suggestions for this?

@ojwb
Copy link
Contributor

ojwb commented May 3, 2019

It does seem a bit unclean that you introduce a discontinuity in behaviour as param_smoothing1 reaches zero - otherwise as it varies the effect varies smoothly. I wonder if not dividing through by param_smoothing1 would work better. Or requiring param_smoothing1 to be non-zero perhaps.

Yeah, a bit unclean, but will work correctly as intended. Do you have any suggestions for this?

I made two suggestions above:

  • not dividing through by param_smoothing1 (this would give the same ranking order for non-zero param_smoothing1 just with weights scaled by param_smoothing1, and when param_smoothing1 == 0 it reduces to weight_document / weight_collection which is also the limit as we tend to zero), or
  • requiring param_smoothing1 to be non-zero

The first seems better (and is even probably slightly more efficient as it saves one division per termweight calculated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants