The TfIdf Weighting scheme #6

Closed
wants to merge 21 commits into
from

Conversation

Projects
None yet
2 participants

The code for tfidfweight.cc does not contain LOGCAL as Ill add it and all relevant documentation for this scheme after working on the feedback.Have implemented all normalizations possible with the current statistics . The code is written in a way that is easy to add more normalizations after rewriting the backend for additional statistics.Have also tried to provide as extensive test coverage as I could .Please do let me know if any changes/additions/deletions are required.

xapian-core/include/xapian/weight.h
@@ -347,6 +347,120 @@ class XAPIAN_VISIBILITY_DEFAULT BoolWeight : public Weight {
double get_maxextra() const;
};
+/* Xapian::Weight subclass implementing the tf-idf weighting scheme with various
+ normalizations for tf ,idf and the tfidf weight . */
@ojwb

ojwb Mar 7, 2013

Contributor

The comments before each class and method are specially formatted, so that they can be automatically extracted by doxygen, and turned into the API documentation you can see on the website, e.g. at http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html

Doxygen needs a comment to start /** or /// or else it assumes it's just a normal comment, not a documentation comment.

The first line should be a short description - if there's much more to say than fits on a line, write a short summary, put a paragraph break, and then go into more detail. Look at existing classes and methods for examples.

Something like Xapian::Weight subclass implementing the tf-idf weighting schemes. would be a good short description here.

Also, take care of whitespace around punctuation: "tf ,idf" -> "tf, idf" and "weight ." -> "weight." It's a small detail, but small details are often quite important in programming, and if the documentation and/or comments look messy, people will probably pick that up subconsciously and assume the code is poor quality too.

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

Yeah, sorry for the poor quality of the documentation.I think I treat it a bit too lightly,but understand it's importance now.Will take great care to write good quality documentation from now on and will pay attention to details.Thanks :)

xapian-core/include/xapian/weight.h
+ * 'S':Square tfn=tf*tf
+ * 'L':Logaritmic tfn=1+log(tf) where base of log is e by default
+ * but can be changed as per need .Mathematically,the base of the logarithm
+ * does not make any difference to the retrieval process.
@ojwb

ojwb Mar 7, 2013

Contributor

"Logaritmic" -> "Logarithmic".

Can the base of the log be changed in the current implementation?

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

Yeah,it can be done easily. To change the base of the log to x ,all you need to do is divide log (1+tf) by log(x) in the code.Though from what I've read,the base of the log doesn't affect the retrieval in any way.

@ojwb

ojwb Mar 11, 2013

Contributor

Sorry, I wasn't clear. I wasn't suggesting it should be possible to change it, but what you've written here seems to suggest that it can be changed by the user "base of log is e by default but can be changed as per need". If the user has to modify the library source code, I wouldn't describe that as something which can be changed (pretty much everything can be changed if you allow for that).

Changing the base of the log just scales the weights by a constant factor, so it isn't worth the complication to the code and API, but the documentation really should match what the code does.

I think doxygen supports <sub>...</sub> for subscript text, so I'd suggest just saying (or if <sub> doesn't work here, ln(tf)):

'L': Logarithmic tfn=1+log<sub>e</sub>(tf)

And leave it at that.

xapian-core/include/xapian/weight.h
+ *
+ * The Max-Tf and Augmented Max Tf normalization can be implemented by
+ * rewriting the backend to obtain the maximum wdf among all terms of a
+ * document.
@ojwb

ojwb Mar 7, 2013

Contributor

It would be clearer to say "... aren't yet implemented".

xapian-core/include/xapian/weight.h
+ * The default string is "NTN".
+ */
+
+ TfIdfWeight(std::string normals)
@ojwb

ojwb Mar 7, 2013

Contributor

Pass std::string by const reference (it's more efficient):

TfIdfWeight(const std::string & normals)

@ojwb

ojwb Mar 7, 2013

Contributor

Also, constructors which take a single argument (or have default arguments such that they can take a single argument) should be marked explicit unless you really want to implicitly allow a std::string to be converted to a TfIdfWeight by the compiler. If you're not familiar with explicit, there's a good write up here:

http://stackoverflow.com/questions/121162/what-does-the-explicit-keyword-in-c-mean

xapian-core/include/xapian/weight.h
+ TfIdfWeight(std::string normals)
+ : normalizations(normals)
+ {
+ // If the normalization string is invalid,set it to the default
@ojwb

ojwb Mar 7, 2013

Contributor

Indentation...

xapian-core/include/xapian/weight.h
+ // If the normalization string is invalid,set it to the default
+ // normalizations
+ if (normalizations.length()!=3) {
+ normalizations="NTN";
@ojwb

ojwb Mar 7, 2013

Contributor

Indentation of "if". Please put a space around operators (!= and = here).

This only rejects a small subset of invalid strings (those that are the wrong length).

If you parsed the string here, you could reject any invalid combination by throwing Xapian::InvalidArgumentError, which would be more helpful to the API user.

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

But I've implemented a default case for each normalization in the switch construct. If the user enters any 3 character string which is not valid, the default case for each character ('N', 'T', and 'N') is used. I checked the length here because if the length of the string is not 3,it is already invalid and we don't want to spend time checking each case in the switch construct.

@ojwb

ojwb Mar 11, 2013

Contributor

The result is rather inconsistent though - if you pass "LT" or "LTNX" you get "NTN", but if you pass "LOL" you get "LTN".

@aarshkshah1992

aarshkshah1992 Mar 11, 2013

Ah.I must admit I hadn't thought of that.Thanks Olly.I'll rewrite the code
and the tests so that an error is thrown whenever the string isn't valid
and remove the default case from all the functions.

On Mon, Mar 11, 2013 at 8:27 AM, Olly Betts notifications@github.comwrote:

In xapian-core/include/xapian/weight.h:

  • \*                         More normalizations for all components can be implemented by
    
  • \*                         changing the backend to acquire the statistics
    
  • \*                         required for the normalizations which are not
    
  • \*                         currently available from Xapian::Weight.
    
  • *
    
  • *
    
  • \*                         The default string is "NTN".
    
  • */
    
  • TfIdfWeight(std::string normals)
  • : normalizations(normals)
  • {
  •   // If the normalization string is invalid,set it to the default
    
  •    // normalizations
    
  •       if (normalizations.length()!=3) {
    
  •        normalizations="NTN";
    

The result is rather inconsistent though - if you pass "LT" or "LTNX" you
get "NTN", but if you pass "LOL" you get "LTN".


Reply to this email directly or view it on GitHubhttps://github.com/xapian/xapian/pull/6/files#r3311719
.

xapian-core/include/xapian/weight.h
+ need_stat(TERMFREQ);
+ need_stat(WDF);
+ need_stat(WDF_MAX);
+ need_stat(COLLECTION_SIZE);
@ojwb

ojwb Mar 7, 2013

Contributor

Some combinations don't need all these stats - e.g. "NNN" doesn't actually use termfreq.

xapian-core/include/xapian/weight.h
+
+ double get_tfn(Xapian::termcount tf, const char c) const;
+ double get_idfn(Xapian::doccount termfreq, const char c) const;
+ double get_wtn(double wt, const char c) const;
@ojwb

ojwb Mar 7, 2013

Contributor

These should be private methods.

xapian-core/include/xapian/weight.h
+ // Not used anywhere but may be useful in the future.
+ int check_tfn(const char c) const;
+ int check_idfn(const char c) const;
+ int check_wtn(const char c) const;
@ojwb

ojwb Mar 7, 2013

Contributor

If they aren't used, overall it's probably better to omit them.

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

Oh,I added them in case anyone wanted to tinker with the characters in the future.I'll remove them captain.

xapian-core/tests/api_weight.cc
@@ -19,6 +19,7 @@
*/
#include <config.h>
+#include <cmath>
@ojwb

ojwb Mar 7, 2013

Contributor

<config.h> should go first, then the include for the header corresponding to the file (i.e. "api_weight.h" in this case), as that helps to ensure that "api_weight.h" is including all the headers it needs. If you put first, "api_weight.h" could be relying on that, and we wouldn't know. That's bad because it means future changes not involving "api_weight.h" could require changes to it.

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

I didn't understand this So like, are you saying that I should include cmath at the beginning of the file ? or after including "api_weight.h" .?

@ojwb

ojwb Mar 11, 2013

Contributor

You should first have <config.h> (as you do), then the header corresponding to the source file (so "api_weight.h" as this is "api_weight.cc"). Any other headers required should come after these two. HACKING gives a full preferred order for header includes, though the full order is not currently used consistently everywhere in the code, as we haven't yet updated all the code which predates coming up with the order.

The main benefit of putting the header corresponding to the .cc file first in that file is that it means every such header is included first (ignoring config.h, which is a special case) somewhere in the source tree, and because of that we'll definitely notice if api_weight.h grows a dependency on <cmath> (or any other header), as the code won't compile until we include the required header from api_weight.h.

xapian-core/tests/api_weight.cc
@@ -92,6 +93,136 @@
return true;
}
+
+/* Tests for TfIdfWeight */
@ojwb

ojwb Mar 7, 2013

Contributor

We don't have "section" comments like this elsewhere, so I wouldn't start adding them.

xapian-core/tests/api_weight.cc
+/* Tests for TfIdfWeight */
+
+
+//Test for various cases of invalid normalization string
@ojwb

ojwb Mar 7, 2013

Contributor

For readability, please put a space between // and the text of the comment.

xapian-core/tests/api_weight.cc
+ TEST_EQUAL(mset.size(), 2);
+ // doc 2 should have higher weight than 4 as only tf(wdf) will dominate.
+ mset_expect_order(mset, 2, 4);
+ TEST_EQUAL(mset[0].get_weight(),(8*log(6/2)));
@ojwb

ojwb Mar 7, 2013

Contributor

Exact comparison of floating point values is unwise, as the order of calculation and issues like excess precision on the x87 FPU can mean you don't get exactly the same answer doing the same calculation in two places in the code. Use the TEST_EQUAL_DOUBLE macro.

xapian-core/tests/api_weight.cc
+
+ // Normalization string parameter should be set to "NTN" if length is not 3
+ Xapian::TfIdfWeight weight("I WILL BE CHANGED");
+ TEST_EQUAL(weight.serialise(),"NTN");
@ojwb

ojwb Mar 7, 2013

Contributor

This test code is assuming what the serialisation is - it would be better to test via the API instead:

    TEST_EQUAL(weight.serialise(), Xapian::TfIdfWeight("NTN").serialise());
xapian-core/weight/tfidfweight.cc
+TfIdfWeight::serialise() const
+{
+ string result = normalizations;
+ return result;
@ojwb

ojwb Mar 7, 2013

Contributor

You can just return normalisations;

+TfIdfWeight *
+TfIdfWeight::unserialise(const string & s) const
+{
+
@ojwb

ojwb Mar 7, 2013

Contributor

Kill these random blank lines at the start of functions.

xapian-core/weight/tfidfweight.cc
+ case 'N':
+ return tf;
+ case 'B':
+ if (tf==0) return 0;
@ojwb

ojwb Mar 7, 2013

Contributor

If we got here, then tf shouldn't be 0.

@aarshkshah1992

aarshkshah1992 Mar 10, 2013

But during our discussion on IRC, you had said that I would need to check for wdf=0 .

@ojwb

ojwb Mar 11, 2013

Contributor

But wdf != tf...

If the tf is zero, then the term doesn't exist in the database, so it shouldn't be possible for us to get here.

@aarshkshah1992

aarshkshah1992 Mar 11, 2013

I think Ive made a bad choice of a variable name here.Actually tf is the
wdf in this code .In boolean normalization,we return tfn as 0 if wdf (tf)
is 0 and tfn as 1 otherwise.
In tf-idf scheme ,tf is the within document frequency.

On Mon, Mar 11, 2013 at 8:40 AM, Olly Betts notifications@github.comwrote:

In xapian-core/weight/tfidfweight.cc:

+double
+TfIdfWeight::get_maxextra() const
+{

  • return 0;
    +}

+// Return normalized tf,idf and weight depending on the normalization string
+double
+TfIdfWeight:: get_tfn(Xapian::termcount tf, const char c) const
+{

  • switch (c) {
  •    case 'N':
    
  •        return tf;
    
  •    case 'B':
    
  •        if (tf==0) return 0;
    

But wdf != tf...

If the tf is zero, then the term doesn't exist in the database, so it
shouldn't be possible for us to get here.


Reply to this email directly or view it on GitHubhttps://github.com/xapian/xapian/pull/6/files#r3311772
.

@ojwb

ojwb Mar 11, 2013

Contributor

Oh, I see!

I would probably go with wdfn and wdf as variable names then, as "termfreq" and "tf" are widely used in Xapian to mean "number of documents which the term appears in".

@aarshkshah1992

aarshkshah1992 Mar 11, 2013

Yeah I do realize now that it is confusing.Will change it.Thanks.

On Tue, Mar 12, 2013 at 1:47 AM, Olly Betts notifications@github.comwrote:

In xapian-core/weight/tfidfweight.cc:

+double
+TfIdfWeight::get_maxextra() const
+{

  • return 0;
    +}

+// Return normalized tf,idf and weight depending on the normalization string
+double
+TfIdfWeight:: get_tfn(Xapian::termcount tf, const char c) const
+{

  • switch (c) {
  •    case 'N':
    
  •        return tf;
    
  •    case 'B':
    
  •        if (tf==0) return 0;
    

Oh, I see!

I would probably go with wdfn and wdf as variable names then, as
"termfreq" and "tf" are widely used in Xapian to mean "number of documents
which the term appears in".


Reply to this email directly or view it on GitHubhttps://github.com/xapian/xapian/pull/6/files#r3324537
.

xapian-core/weight/tfidfweight.cc
+ case 'N':
+ return 1.0;
+ case 'T':
+ if (N==0) return 0; //Database can be empty
@ojwb

ojwb Mar 7, 2013

Contributor

If the database is empty, then there are no posting lists, so we should never get here.

xapian-core/weight/tfidfweight.cc
+ char tfn_array[]={'N','B','S','L','\0'};
+ for (int i=0;tfn_array[i]!='\0';++i) {
+ if (tfn_array[i] == c) return 1;
+ }
@ojwb

ojwb Mar 7, 2013

Contributor

I'd suggest culling these functions, but while we're here - arrays which don't change should always be constant - that way they can go in the read-only data section which can be shared among programs using the libxapian shared library (rather than using different memory in each process).

But in fact you can write this whole function body as simply:

    return strchr("NBSL", c) != NULL;
xapian-core/weight/tfidfweight.cc
+
+// Check for validity of each character of string
+// Not used anywhere but maybe useful in the future
+int TfIdfWeight:: check_tfn(const char c) const
@ojwb

ojwb Mar 7, 2013

Contributor

Generally you wouldn't mark a parameter of simple type like "char" as const in the function prototype.

Sorry for hurting the existing BM25 code below, I know it's irritating. I'll take great care to avoid that in my patches now .

Contributor

ojwb commented Apr 8, 2013

There's a few loose ends to tie up still, but I don't see any point holding off so I've applied this to trunk now.

Thanks for all your work on this.

@ojwb ojwb closed this Apr 8, 2013

Hey,thanks a lot !! :)

On Mon, Apr 8, 2013 at 12:06 PM, Olly Betts notifications@github.comwrote:

There's a few loose ends to tie up still, but I don't see any point
holding off so I've applied this to trunk now.

Thanks for all your work on this.


Reply to this email directly or view it on GitHubhttps://github.com/xapian/xapian/pull/6#issuecomment-16034793
.

I went through the list of corrections you've made and understood all of
them except for the last one related to indentation.

On Mon, Apr 8, 2013 at 9:01 PM, aarsh shah aarshkshah1992@gmail.com wrote:

Hey,thanks a lot !! :)

On Mon, Apr 8, 2013 at 12:06 PM, Olly Betts notifications@github.comwrote:

There's a few loose ends to tie up still, but I don't see any point
holding off so I've applied this to trunk now.

Thanks for all your work on this.


Reply to this email directly or view it on GitHubhttps://github.com/xapian/xapian/pull/6#issuecomment-16034793
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment