The DFR PL2 and DPH Weighting Scheme #8

Closed
wants to merge 31 commits into
from

Projects

None yet

2 participants

@aarshkshah1992

I have added the collection frequency statistic to the Xapian::Weight class so that it can be accessed by the subclasses.I created a new map for this statistic.I've also tried my best to make the documentation neat as I've learnt from the previous feedback's Ive got.

The tests work well on all backends other than the remote backends because the collection frequency has not been added to the TermFreqs map.

Have added simple tests for the scheme.Will add feature tests after working on the feedback.

The base of the log is 2 everywhere as I've decided to follow the paper on DFR schemes by Amati rather than his thesis because a base of log 2 makes more sense as the normalized wdf will be wdf for a document of average length if we use a base of 2.

Future DFR schemes will be easier to implement after I learn from the feedback of this scheme.

@aarshkshah1992

The formula for P before expanding it is - lg (((e ^ (-lamda)) * (lamda ^ wdf)) / (wdf !)) where lamda is a constant and is always < 1.

1.) Thus,greater the wdf,smaller is (lamda ^wdf).

2.) wdf ! also increases as wdf increases and thus the component inside the log decreases.

3.) ( - log) increases as the component inside the log decreases and thus P is directly proportional to wdf.

ojwb replied May 31, 2013

Hmm, "directly proportional" isn't the right term (that would be like P = lambda * wdf - http://en.wikipedia.org/wiki/Proportionality_%28mathematics%29#Direct_proportionality) but I know what you mean - increasing wdf increases P, and decreasing wdf decreases P. I don't recall what that's actually called.

@aarshkshah1992

A zero value of c is not allowed and so I've removed this test.

@aarshkshah1992

Sorry for damaging the existing entries.

@ojwb ojwb commented on the diff Mar 23, 2013
xapian-core/include/xapian/weight.h
@@ -45,7 +45,8 @@ class XAPIAN_VISIBILITY_DEFAULT Weight {
DOC_LENGTH = 256,
DOC_LENGTH_MIN = 512,
DOC_LENGTH_MAX = 1024,
- WDF_MAX = 2048
+ WDF_MAX = 2048,
+ COLLEC_FREQ=4096
@ojwb
ojwb Mar 23, 2013 Contributor

For consistency with COLLECTION_SIZE and Database::get_collection_freq(), I think COLLECTION_FREQ is better.

@ojwb ojwb commented on the diff Mar 23, 2013
xapian-core/include/xapian/weight.h
@@ -285,6 +289,9 @@ class XAPIAN_VISIBILITY_DEFAULT Weight {
/// The number of documents which this term indexes.
Xapian::doccount get_termfreq() const { return termfreq_; }
+ /// The total number of times this term occurs in the collection.
+ Xapian::termcount get_collec_freq() const { return collec_freq_; }
@ojwb
ojwb Mar 23, 2013 Contributor

And get_collection_freq() here, to match Database::get_collection_freq().

@ojwb ojwb commented on the diff Mar 23, 2013
xapian-core/weight/weightinternal.h
@@ -74,6 +74,9 @@ class Weight::Internal {
* collection. */
std::map<std::string, TermFreqs> termfreqs;
+ /** Map of collection frequencies for the collection. */
+ std::map<std::string, Xapian::termcount> collec_freqs;
@ojwb
ojwb Mar 23, 2013 Contributor

I think the collection frequency may need to go into the TermFreqs class rather than having a separate map for it - as things stand I'm not sure this is going to work properly with the remote backend or OP_SYNONYM.
I haven't tried to test this yet though. Do your new testcases pass with the remote backend?

aarsh kiran ... added some commits Mar 24, 2013
aarsh kiran mansukhlal kalyanji nanchand shah Added Xapian::DFR_PL2Weight to the registry. 398d2e4
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_PL2Weight to the makefile in Csharp bindings. ec121d9
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_PL2Weight to the makefile for java bindings. a66ae64
aarsh kiran mansukhlal kalyanji nanchand shah Test for conversion of invalid parameter to default value. c5995cd
aarsh kiran mansukhlal kalyanji nanchand shah Added test for dfr_pl2weight. 5edce85
aarsh kiran mansukhlal kalyanji nanchand shah Added feature tests for DFR_PL2Weight. d2d0cf2
aarsh kiran mansukhlal kalyanji nanchand shah Removed trailing whitespaces from api_weight.cc. 4a0eb21
aarsh kiran mansukhlal kalyanji nanchand shah Removed trailing whitespaces from dfr_pl2weight.cc d9ed67e
aarsh kiran mansukhlal kalyanji nanchand shah Removed trailing whitespaces from weight.h. 874920c
aarsh kiran mansukhlal kalyanji nanchand shah Added one more feature test for DFR_PL2Weight. e130842
aarsh kiran mansukhlal kalyanji nanchand shah Test file for the DFR_PL2 weighting scheme. 780b673
aarsh kiran mansukhlal kalyanji nanchand shah Placed <cmath> after "xapian/weight.h". fc3b87b
@aarshkshah1992

I will update the method to obtain collection frequency in the code and the tests once Gaurav's code gets merged.

aarsh kiran ... added some commits Mar 27, 2013
aarsh kiran mansukhlal kalyanji nanchand shah Corrected error and used DFR_PL2 weighting scheme in the tests for PL2. b97824b
aarsh kiran mansukhlal kalyanji nanchand shah Removed apitest_declen PL2 test as F not very less than N. 896f3ef
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_DPHWeight to weight.h and corrected indentation in the PL2 …
…class.
a445480
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_DPHWeight to the registry. 14c0c55
aarsh kiran mansukhlal kalyanji nanchand shah dfr_dphweight.cc contains implementation of DFR_DPHWeight. 2d67d42
aarsh kiran mansukhlal kalyanji nanchand shah Added dfr_dphweight.cc to the Makefile. eaa2c4e
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_DPHWeight to the csharp makefile. 6cbf600
aarsh kiran mansukhlal kalyanji nanchand shah Added DFR_DPHWeight to the java bindings makefile. 47a6656
aarsh kiran mansukhlal kalyanji nanchand shah Simple test for DFR_DPHWeight. 403d10f
aarsh kiran mansukhlal kalyanji nanchand shah Feature tests for the dph weighting scheme. 0004565
@ojwb
Contributor
ojwb commented May 31, 2013

I think we should probably just drop the DFR_ prefix on the weight class names - there isn't really any ambiguity, and the longer name is more cumbersome. We don't have Okapi_BM25Weight or SMART_TfIdfWeight either.

And we also don't have any other API class names with a _ in currently. If we did need to discriminate, I think a namespace would be a better way - e.g. Xapian::DFR::PL2Weight. But there isn't another PL2 weighting scheme, and it seems unlikely someone would now use that name for a different scheme.

If you want to automate an update like this, you can use git grep to find the files that feature the pattern you want, and then sed -i to do an in-place modification to them:

git grep -l '\<DFR_' | xargs sed -i 's/\<DFR_//g'

Then you can check the changes made were what you wanted with git diff and if not, revert them with git checkout . (assuming you started with a clean working tree) and try again.

@aarshkshah1992

Yes,I understand. WIll update the pull request.

@ojwb
Contributor
ojwb commented Jun 11, 2013

Can you rebase this onto latest master (which has the collection frequency changes merged in)?

@aarshkshah1992 aarshkshah1992 deleted the aarshkshah1992:dfr_weight branch Jun 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment