Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

The TfIdf Weighting scheme #6

Closed
wants to merge 21 commits into from

2 participants

@aarshkshah1992

The code for tfidfweight.cc does not contain LOGCAL as Ill add it and all relevant documentation for this scheme after working on the feedback.Have implemented all normalizations possible with the current statistics . The code is written in a way that is easy to add more normalizations after rewriting the backend for additional statistics.Have also tried to provide as extensive test coverage as I could .Please do let me know if any changes/additions/deletions are required.

xapian-core/include/xapian/weight.h
@@ -347,6 +347,120 @@ class XAPIAN_VISIBILITY_DEFAULT BoolWeight : public Weight {
double get_maxextra() const;
};
+/* Xapian::Weight subclass implementing the tf-idf weighting scheme with various
+ normalizations for tf ,idf and the tfidf weight . */
@ojwb Owner
ojwb added a note

The comments before each class and method are specially formatted, so that they can be automatically extracted by doxygen, and turned into the API documentation you can see on the website, e.g. at http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html

Doxygen needs a comment to start /** or /// or else it assumes it's just a normal comment, not a documentation comment.

The first line should be a short description - if there's much more to say than fits on a line, write a short summary, put a paragraph break, and then go into more detail. Look at existing classes and methods for examples.

Something like Xapian::Weight subclass implementing the tf-idf weighting schemes. would be a good short description here.

Also, take care of whitespace around punctuation: "tf ,idf" -> "tf, idf" and "weight ." -> "weight." It's a small detail, but small details are often quite important in programming, and if the documentation and/or comments look messy, people will probably pick that up subconsciously and assume the code is poor quality too.

Yeah, sorry for the poor quality of the documentation.I think I treat it a bit too lightly,but understand it's importance now.Will take great care to write good quality documentation from now on and will pay attention to details.Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((16 lines not shown))
+ /** Construct a TfIdfWeight
+ *
+ * @param normalizations A three character string indicating the normalizations
+ * to be used for the tf,idf and document weight
+ * respectively.
+ *
+ * The first character specifies the normalization
+ * for the tf for which the following normalizations
+ * are currently available:
+ *
+ * 'N':None. tfn=tf
+ * 'B':Boolean tfn=1 if t in document else tfn=0
+ * 'S':Square tfn=tf*tf
+ * 'L':Logaritmic tfn=1+log(tf) where base of log is e by default
+ * but can be changed as per need .Mathematically,the base of the logarithm
+ * does not make any difference to the retrieval process.
@ojwb Owner
ojwb added a note

"Logaritmic" -> "Logarithmic".

Can the base of the log be changed in the current implementation?

Yeah,it can be done easily. To change the base of the log to x ,all you need to do is divide log (1+tf) by log(x) in the code.Though from what I've read,the base of the log doesn't affect the retrieval in any way.

@ojwb Owner
ojwb added a note

Sorry, I wasn't clear. I wasn't suggesting it should be possible to change it, but what you've written here seems to suggest that it can be changed by the user "base of log is e by default but can be changed as per need". If the user has to modify the library source code, I wouldn't describe that as something which can be changed (pretty much everything can be changed if you allow for that).

Changing the base of the log just scales the weights by a constant factor, so it isn't worth the complication to the code and API, but the documentation really should match what the code does.

I think doxygen supports <sub>...</sub> for subscript text, so I'd suggest just saying (or if <sub> doesn't work here, ln(tf)):

'L': Logarithmic tfn=1+log<sub>e</sub>(tf)

And leave it at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((20 lines not shown))
+ * respectively.
+ *
+ * The first character specifies the normalization
+ * for the tf for which the following normalizations
+ * are currently available:
+ *
+ * 'N':None. tfn=tf
+ * 'B':Boolean tfn=1 if t in document else tfn=0
+ * 'S':Square tfn=tf*tf
+ * 'L':Logaritmic tfn=1+log(tf) where base of log is e by default
+ * but can be changed as per need .Mathematically,the base of the logarithm
+ * does not make any difference to the retrieval process.
+ *
+ * The Max-Tf and Augmented Max Tf normalization can be implemented by
+ * rewriting the backend to obtain the maximum wdf among all terms of a
+ * document.
@ojwb Owner
ojwb added a note

It would be clearer to say "... aren't yet implemented".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((53 lines not shown))
+ * 'N':None wtn=tfn*idfn
+ * Implementing more normalizaions for the weight requires access to
+ * statistics such as the weight of all terms in the document indexed by
+ * the term in the query. This is not available from the current backend.
+ *
+ *
+ * More normalizations for all components can be implemented by
+ * changing the backend to acquire the statistics
+ * required for the normalizations which are not
+ * currently available from Xapian::Weight.
+ *
+ *
+ * The default string is "NTN".
+ */
+
+ TfIdfWeight(std::string normals)
@ojwb Owner
ojwb added a note

Pass std::string by const reference (it's more efficient):

TfIdfWeight(const std::string & normals)

@ojwb Owner
ojwb added a note

Also, constructors which take a single argument (or have default arguments such that they can take a single argument) should be marked explicit unless you really want to implicitly allow a std::string to be converted to a TfIdfWeight by the compiler. If you're not familiar with explicit, there's a good write up here:

http://stackoverflow.com/questions/121162/what-does-the-explicit-keyword-in-c-mean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((56 lines not shown))
+ * the term in the query. This is not available from the current backend.
+ *
+ *
+ * More normalizations for all components can be implemented by
+ * changing the backend to acquire the statistics
+ * required for the normalizations which are not
+ * currently available from Xapian::Weight.
+ *
+ *
+ * The default string is "NTN".
+ */
+
+ TfIdfWeight(std::string normals)
+ : normalizations(normals)
+ {
+ // If the normalization string is invalid,set it to the default
@ojwb Owner
ojwb added a note

Indentation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((59 lines not shown))
+ * More normalizations for all components can be implemented by
+ * changing the backend to acquire the statistics
+ * required for the normalizations which are not
+ * currently available from Xapian::Weight.
+ *
+ *
+ * The default string is "NTN".
+ */
+
+ TfIdfWeight(std::string normals)
+ : normalizations(normals)
+ {
+ // If the normalization string is invalid,set it to the default
+ // normalizations
+ if (normalizations.length()!=3) {
+ normalizations="NTN";
@ojwb Owner
ojwb added a note

Indentation of "if". Please put a space around operators (!= and = here).

This only rejects a small subset of invalid strings (those that are the wrong length).

If you parsed the string here, you could reject any invalid combination by throwing Xapian::InvalidArgumentError, which would be more helpful to the API user.

But I've implemented a default case for each normalization in the switch construct. If the user enters any 3 character string which is not valid, the default case for each character ('N', 'T', and 'N') is used. I checked the length here because if the length of the string is not 3,it is already invalid and we don't want to spend time checking each case in the switch construct.

@ojwb Owner
ojwb added a note

The result is rather inconsistent though - if you pass "LT" or "LTNX" you get "NTN", but if you pass "LOL" you get "LTN".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((64 lines not shown))
+ *
+ * The default string is "NTN".
+ */
+
+ TfIdfWeight(std::string normals)
+ : normalizations(normals)
+ {
+ // If the normalization string is invalid,set it to the default
+ // normalizations
+ if (normalizations.length()!=3) {
+ normalizations="NTN";
+ }
+ need_stat(TERMFREQ);
+ need_stat(WDF);
+ need_stat(WDF_MAX);
+ need_stat(COLLECTION_SIZE);
@ojwb Owner
ojwb added a note

Some combinations don't need all these stats - e.g. "NNN" doesn't actually use termfreq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((93 lines not shown))
+ std::string serialise() const;
+ TfIdfWeight * unserialise(const std::string & s) const;
+
+ double get_sumpart(Xapian::termcount wdf,
+ Xapian::termcount doclen) const;
+ double get_maxpart() const;
+
+ double get_sumextra(Xapian::termcount doclen) const;
+ double get_maxextra() const;
+
+ /* When additional normalizations are implemented in the future,
+ the additional statistics for them should be accesed by these functions */
+
+ double get_tfn(Xapian::termcount tf, const char c) const;
+ double get_idfn(Xapian::doccount termfreq, const char c) const;
+ double get_wtn(double wt, const char c) const;
@ojwb Owner
ojwb added a note

These should be private methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/include/xapian/weight.h
((99 lines not shown))
+
+ double get_sumextra(Xapian::termcount doclen) const;
+ double get_maxextra() const;
+
+ /* When additional normalizations are implemented in the future,
+ the additional statistics for them should be accesed by these functions */
+
+ double get_tfn(Xapian::termcount tf, const char c) const;
+ double get_idfn(Xapian::doccount termfreq, const char c) const;
+ double get_wtn(double wt, const char c) const;
+
+ // Functions to check for validity of each character of the normalization string .
+ // Not used anywhere but may be useful in the future.
+ int check_tfn(const char c) const;
+ int check_idfn(const char c) const;
+ int check_wtn(const char c) const;
@ojwb Owner
ojwb added a note

If they aren't used, overall it's probably better to omit them.

Oh,I added them in case anyone wanted to tinker with the characters in the future.I'll remove them captain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/tests/api_weight.cc
@@ -19,6 +19,7 @@
*/
#include <config.h>
+#include <cmath>
@ojwb Owner
ojwb added a note

should go first, then the include for the header corresponding to the file (i.e. "api_weight.h" in this case), as that helps to ensure that "api_weight.h" is including all the headers it needs. If you put first, "api_weight.h" could be relying on that, and we wouldn't know. That's bad because it means future changes not involving "api_weight.h" could require changes to it.

I didn't understand this So like, are you saying that I should include cmath at the beginning of the file ? or after including "api_weight.h" .?

@ojwb Owner
ojwb added a note

You should first have <config.h> (as you do), then the header corresponding to the source file (so "api_weight.h" as this is "api_weight.cc"). Any other headers required should come after these two. HACKING gives a full preferred order for header includes, though the full order is not currently used consistently everywhere in the code, as we haven't yet updated all the code which predates coming up with the order.

The main benefit of putting the header corresponding to the .cc file first in that file is that it means every such header is included first (ignoring config.h, which is a special case) somewhere in the source tree, and because of that we'll definitely notice if api_weight.h grows a dependency on <cmath> (or any other header), as the code won't compile until we include the required header from api_weight.h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/tests/api_weight.cc
@@ -92,6 +93,136 @@
return true;
}
+
+/* Tests for TfIdfWeight */
@ojwb Owner
ojwb added a note

We don't have "section" comments like this elsewhere, so I wouldn't start adding them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/tests/api_weight.cc
@@ -92,6 +93,136 @@
return true;
}
+
+/* Tests for TfIdfWeight */
+
+
+//Test for various cases of invalid normalization string
@ojwb Owner
ojwb added a note

For readability, please put a space between // and the text of the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/tests/api_weight.cc
((8 lines not shown))
+//Test for various cases of invalid normalization string
+DEFINE_TESTCASE(tfidfweight1, backend) {
+ Xapian::Database db = get_database("apitest_simpledata");
+ Xapian::Enquire enquire(db);
+ Xapian::MSet mset;
+
+ /* Default cases of normalization i.e 'N','T' and 'N' should be used by the
+ normalization functions if normalizationg string is of length 3 but is
+ invalid */
+ enquire.set_query(Xapian::Query("word"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("CYA"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 2);
+ // doc 2 should have higher weight than 4 as only tf(wdf) will dominate.
+ mset_expect_order(mset, 2, 4);
+ TEST_EQUAL(mset[0].get_weight(),(8*log(6/2)));
@ojwb Owner
ojwb added a note

Exact comparison of floating point values is unwise, as the order of calculation and issues like excess precision on the x87 FPU can mean you don't get exactly the same answer doing the same calculation in two places in the code. Use the TEST_EQUAL_DOUBLE macro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/tests/api_weight.cc
((12 lines not shown))
+ Xapian::MSet mset;
+
+ /* Default cases of normalization i.e 'N','T' and 'N' should be used by the
+ normalization functions if normalizationg string is of length 3 but is
+ invalid */
+ enquire.set_query(Xapian::Query("word"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("CYA"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 2);
+ // doc 2 should have higher weight than 4 as only tf(wdf) will dominate.
+ mset_expect_order(mset, 2, 4);
+ TEST_EQUAL(mset[0].get_weight(),(8*log(6/2)));
+
+ // Normalization string parameter should be set to "NTN" if length is not 3
+ Xapian::TfIdfWeight weight("I WILL BE CHANGED");
+ TEST_EQUAL(weight.serialise(),"NTN");
@ojwb Owner
ojwb added a note

This test code is assuming what the serialisation is - it would be better to test via the API instead:

    TEST_EQUAL(weight.serialise(), Xapian::TfIdfWeight("NTN").serialise());
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/weight/tfidfweight.cc
((43 lines not shown))
+TfIdfWeight::init(double)
+{
+ // None required
+}
+
+string
+TfIdfWeight::name() const
+{
+ return "Xapian::TfIdfWeight";
+}
+
+string
+TfIdfWeight::serialise() const
+{
+ string result = normalizations;
+ return result;
@ojwb Owner
ojwb added a note

You can just return normalisations;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ojwb ojwb commented on the diff
xapian-core/weight/tfidfweight.cc
((49 lines not shown))
+TfIdfWeight::name() const
+{
+ return "Xapian::TfIdfWeight";
+}
+
+string
+TfIdfWeight::serialise() const
+{
+ string result = normalizations;
+ return result;
+}
+
+TfIdfWeight *
+TfIdfWeight::unserialise(const string & s) const
+{
+
@ojwb Owner
ojwb added a note

Kill these random blank lines at the start of functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/weight/tfidfweight.cc
((93 lines not shown))
+
+double
+TfIdfWeight::get_maxextra() const
+{
+ return 0;
+}
+
+// Return normalized tf,idf and weight depending on the normalization string
+double
+TfIdfWeight:: get_tfn(Xapian::termcount tf, const char c) const
+{
+ switch (c) {
+ case 'N':
+ return tf;
+ case 'B':
+ if (tf==0) return 0;
@ojwb Owner
ojwb added a note

If we got here, then tf shouldn't be 0.

But during our discussion on IRC, you had said that I would need to check for wdf=0 .

@ojwb Owner
ojwb added a note

But wdf != tf...

If the tf is zero, then the term doesn't exist in the database, so it shouldn't be possible for us to get here.

@ojwb Owner
ojwb added a note

Oh, I see!

I would probably go with wdfn and wdf as variable names then, as "termfreq" and "tf" are widely used in Xapian to mean "number of documents which the term appears in".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/weight/tfidfweight.cc
((113 lines not shown))
+ if (tf==0) return 0;
+ else return (1+log(tf));
+ default:
+ return tf;
+ }
+}
+
+double
+TfIdfWeight::get_idfn(Xapian::doccount termfreq, const char c) const
+{
+ Xapian::doccount N=get_collection_size();
+ switch (c) {
+ case 'N':
+ return 1.0;
+ case 'T':
+ if (N==0) return 0; //Database can be empty
@ojwb Owner
ojwb added a note

If the database is empty, then there are no posting lists, so we should never get here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/weight/tfidfweight.cc
((145 lines not shown))
+ case 'N':
+ return wt;
+ default:
+ return wt;
+ }
+}
+
+// Check for validity of each character of string
+// Not used anywhere but maybe useful in the future
+int TfIdfWeight:: check_tfn(const char c) const
+{
+ // Add characters to this array when more normalizations are implemented
+ char tfn_array[]={'N','B','S','L','\0'};
+ for (int i=0;tfn_array[i]!='\0';++i) {
+ if (tfn_array[i] == c) return 1;
+ }
@ojwb Owner
ojwb added a note

I'd suggest culling these functions, but while we're here - arrays which don't change should always be constant - that way they can go in the read-only data section which can be shared among programs using the libxapian shared library (rather than using different memory in each process).

But in fact you can write this whole function body as simply:

    return strchr("NBSL", c) != NULL;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xapian-core/weight/tfidfweight.cc
((139 lines not shown))
+double
+TfIdfWeight:: get_wtn(double wt, const char c) const
+{
+/*Include future implementations of weight normalizations in the switch
+ construct*/
+ switch (c) {
+ case 'N':
+ return wt;
+ default:
+ return wt;
+ }
+}
+
+// Check for validity of each character of string
+// Not used anywhere but maybe useful in the future
+int TfIdfWeight:: check_tfn(const char c) const
@ojwb Owner
ojwb added a note

Generally you wouldn't mark a parameter of simple type like "char" as const in the function prototype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@aarshkshah1992

Sorry for hurting the existing BM25 code below, I know it's irritating. I'll take great care to avoid that in my patches now .

@ojwb
Owner

There's a few loose ends to tie up still, but I don't see any point holding off so I've applied this to trunk now.

Thanks for all your work on this.

@ojwb ojwb closed this
@aarshkshah1992
@aarshkshah1992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 27, 2013
  1. @aarshkshah1992
  2. @aarshkshah1992
Commits on Feb 28, 2013
  1. @aarshkshah1992
Commits on Mar 3, 2013
  1. @aarshkshah1992
  2. @aarshkshah1992

    Modified Unserialise to check for extra data and added fucntions for …

    aarshkshah1992 authored
    …checking validity of normalizations string .
  3. @aarshkshah1992
  4. @aarshkshah1992
  5. @aarshkshah1992
  6. @aarshkshah1992
Commits on Mar 4, 2013
  1. @aarshkshah1992
Commits on Mar 5, 2013
  1. @aarshkshah1992
Commits on Mar 20, 2013
  1. @aarshkshah1992
  2. @aarshkshah1992
  3. @aarshkshah1992
Commits on Mar 22, 2013
  1. @ojwb
  2. @ojwb
  3. @ojwb
Commits on Mar 24, 2013
  1. Added Xapian::TfIdfWeight to the Registry.

    aarsh kiran mansukhlal kalyanji nanchand shah authored
  2. Added TfIdfWeight to csharp makefile.

    aarsh kiran mansukhlal kalyanji nanchand shah authored
  3. Added TfIdfWeight to the java makefile.

    aarsh kiran mansukhlal kalyanji nanchand shah authored
  4. @aarshkshah1992

    Merge pull request #2 from ojwb/tfidf

    aarshkshah1992 authored
    Tfidf whitespace-related fixes
This page is out of date. Refresh to see the latest.
View
1  xapian-bindings/csharp/Makefile.am
@@ -58,6 +58,7 @@ XAPIAN_SWIG_CS_SRCS=\
generated-csharp/StringValueRangeProcessor.cs \
generated-csharp/TermGenerator.cs \
generated-csharp/TermIterator.cs \
+ generated-csharp/TfIdfWeight.cs \
generated-csharp/TradWeight.cs \
generated-csharp/ValueCountMatchSpy.cs \
generated-csharp/ValueIterator.cs \
View
1  xapian-bindings/java/Makefile.am
@@ -67,6 +67,7 @@ XAPIAN_SWIG_JAVA_SRCS=\
org/xapian/SWIGTYPE_p_std__string.java\
org/xapian/TermGenerator.java\
org/xapian/TermIterator.java\
+ org/xapian/TfIdfWeight.java\
org/xapian/TradWeight.java\
org/xapian/ValueCountMatchSpy.java\
org/xapian/ValueIterator.java\
View
12 xapian-core/ChangeLog
@@ -1,3 +1,15 @@
+Sun Mar 3 8:37 AM GMT 2013 Aarsh Shah <aarshkshah1992@gmail.com>
+
+ * weight/: Added tfidfweight.cc containing the implementation of the
+ TfIdfWeight class.
+
+ * include/xapian/weight.h: Added TfIdfWeight class for the tf-idf
+ weighting scheme.
+
+ * tests/api_weight.cc: Added tests for TfIdfWeight.
+
+ * tests/api_nodb.cc : Added simple tests for TfIdfWeight.
+
Tue Feb 19 04:17:19 GMT 2013 Olly Betts <olly@survex.com>
* common/Tokeniseise.pm: Add the ability to append lines to the
View
2  xapian-core/api/registry.cc
@@ -196,6 +196,8 @@ Registry::Internal::add_defaults()
wtschemes[weighting_scheme->name()] = weighting_scheme;
weighting_scheme = new Xapian::TradWeight;
wtschemes[weighting_scheme->name()] = weighting_scheme;
+ weighting_scheme = new Xapian::TfIdfWeight;
+ wtschemes[weighting_scheme->name()] = weighting_scheme;
Xapian::PostingSource * source;
source = new Xapian::ValueWeightPostingSource(0);
View
103 xapian-core/include/xapian/weight.h
@@ -23,9 +23,11 @@
#define XAPIAN_INCLUDED_WEIGHT_H
#include <string>
+#include <cstring>
#include <xapian/types.h>
#include <xapian/visibility.h>
+#include <xapian/error.h>
namespace Xapian {
@@ -347,6 +349,107 @@ class XAPIAN_VISIBILITY_DEFAULT BoolWeight : public Weight {
double get_maxextra() const;
};
+/// Xapian::Weight subclass implementing the tf-idf weighting scheme.
+class XAPIAN_VISIBILITY_DEFAULT TfIdfWeight : public Weight {
+ /* Three character string indicating the normalizations for tf(wdf), idf and
+ tfidf weight. */
+ std::string normalizations;
+
+ TfIdfWeight * clone() const;
+
+ void init(double factor);
+
+ /* When additional normalizations are implemented in the future, the additional statistics for them
+ should be accesed by these functions. */
+ double get_wdfn(Xapian::termcount wdf, char c) const;
+ double get_idfn(Xapian::doccount termfreq, char c) const;
+ double get_wtn(double wt, char c) const;
+
+ public:
+ /** Construct a TfIdfWeight
+ *
+ * @param normalizations A three character string indicating the normalizations
+ * to be used for the tf(wdf), idf and document weight
+ * respectively.
+ *
+ * The first character specifies the normalization
+ * for the wdf for which the following normalizations
+ * are currently available:
+ *
+ * 'N':None. wdfn=wdf
+ * 'B':Boolean wdfn=1 if term in document else wdfn=0
+ * 'S':Square wdfn=wdf*wdf
+ * 'L':Logarithmic wdfn=1+log<sub>e</sub>(wdf)
+ *
+ * The Max-wdf and Augmented Max wdf normalization aren't yet implemented.
+ *
+ *
+ * The second character indicates the normalization
+ * for the idf, the following of which are currently
+ * available:
+ *
+ * 'N':None idfn=1
+ * 'T':TfIdf idfn=log(N/Termfreq) where N is the number of documents in
+ * collection and Termfreq is the number of documents which are
+ * indexed by the term t.
+ * 'P':Prob idfn=log((N-Termfreq)/Termfreq)
+ *
+ *
+ * The third and the final character indicates the
+ * normalizaton for the document weight of which
+ * the following are currently available:
+ *
+ * 'N':None wtn=tfn*idfn
+ * Implementing more normalizaions for the weight requires access to
+ * statistics such as the weight of all terms in the document indexed by
+ * the term in the query. This is not available from the current backend.
+ *
+ *
+ * More normalizations for all components can be implemented by
+ * changing the backend to acquire the statistics
+ * required for the normalizations which are not
+ * currently available from Xapian::Weight.
+ *
+ *
+ * The default string is "NTN".
+ */
+
+ explicit TfIdfWeight(const std::string &normals)
+ : normalizations(normals)
+ {
+ if (normalizations.length() != 3 || (! strchr("NBSL", normalizations[0])) || (! strchr("NTP", normalizations[1])) || (! strchr("N", normalizations[2])))
+ throw Xapian::InvalidArgumentError("Normalization string is invalid");
+ if (normalizations[1] != 'N') {
+ need_stat(TERMFREQ);
+ need_stat(COLLECTION_SIZE);
+ }
+ need_stat(WDF);
+ need_stat(WDF_MAX);
+ }
+
+ TfIdfWeight()
+ : normalizations("NTN")
+ {
+ need_stat(TERMFREQ);
+ need_stat(WDF);
+ need_stat(WDF_MAX);
+ need_stat(COLLECTION_SIZE);
+ }
+
+ std::string name() const;
+
+ std::string serialise() const;
+ TfIdfWeight * unserialise(const std::string & s) const;
+
+ double get_sumpart(Xapian::termcount wdf,
+ Xapian::termcount doclen) const;
+ double get_maxpart() const;
+
+ double get_sumextra(Xapian::termcount doclen) const;
+ double get_maxextra() const;
+};
+
+
/// Xapian::Weight subclass implementing the BM25 probabilistic formula.
class XAPIAN_VISIBILITY_DEFAULT BM25Weight : public Weight {
/// Factor to multiply the document length by.
View
11 xapian-core/tests/api_nodb.cc
@@ -302,6 +302,17 @@ DEFINE_TESTCASE(weight1, !backend) {
Xapian::BM25Weight bm25weight2(1, 0.5, 1, 0.5, 0.5);
TEST_NOT_EQUAL(bm25weight.serialise(), bm25weight2.serialise());
+ Xapian::TfIdfWeight tfidfweight_dflt;
+ Xapian::TfIdfWeight tfidfweight("NTN");
+ TEST_EQUAL(tfidfweight.name(), "Xapian::TfIdfWeight");
+ TEST_EQUAL(tfidfweight_dflt.serialise(), tfidfweight.serialise());
+ wt = Xapian::TfIdfWeight().unserialise(tfidfweight.serialise());
+ TEST_EQUAL(tfidfweight.serialise(), wt->serialise());
+ delete wt;
+
+ Xapian::TfIdfWeight tfidfweight2("BPN");
+ TEST_NOT_EQUAL(tfidfweight.serialise(), tfidfweight2.serialise());
+
return true;
}
View
116 xapian-core/tests/api_weight.cc
@@ -21,6 +21,7 @@
#include <config.h>
#include "api_weight.h"
+#include <cmath>
#include <xapian.h>
@@ -92,6 +93,121 @@ DEFINE_TESTCASE(bm25weight4, backend) {
return true;
}
+// Test for various cases of normalization string.
+DEFINE_TESTCASE(tfidfweight1, !backend) {
+ // InvalidArgumentError should be thrown if normalization string is invalid
+ try {
+ Xapian::TfIdfWeight b("JOHN_LENNON");
+ FAIL_TEST("Xapian::InvalidArgumentError not thrown for invalid normalization string");
+ } catch (const Xapian::InvalidArgumentError &x) {
+ // Good!
+ }
+
+ try {
+ Xapian::TfIdfWeight c("LOL");
+ FAIL_TEST("Xapian::InvalidArgumentError not thrown for invalid normalization string");
+ } catch (const Xapian::InvalidArgumentError &x) {
+ // Good!
+ }
+
+ /* Normalization string should be set to "NTN" by constructor if none is
+ given. */
+ Xapian::TfIdfWeight weight2;
+ TEST_EQUAL(weight2.serialise(), Xapian::TfIdfWeight("NTN").serialise());
+
+ return true;
+}
+
+// Test exception for junk after serialised weight.
+DEFINE_TESTCASE(tfidfweight2, !backend) {
+ Xapian::TfIdfWeight wt("NTN");
+ try {
+ Xapian::TfIdfWeight b;
+ Xapian::TfIdfWeight * b2 = b.unserialise(wt.serialise() + "X");
+ // Make sure we actually use the weight.
+ bool empty = b2->name().empty();
+ delete b2;
+ if (empty)
+ FAIL_TEST("Serialised TfIdfWeight with junk appended unserialised to empty name!");
+ FAIL_TEST("Serialised TfIdfWeight with junk appended unserialised OK");
+ } catch (const Xapian::SerialisationError &) {
+
+ }
+ return true;
+}
+
+// Feature tests for various normalization functions.
+DEFINE_TESTCASE(tfidfweight3, backend) {
+ Xapian::Database db = get_database("apitest_simpledata");
+ Xapian::Enquire enquire(db);
+ Xapian::MSet mset;
+
+ // Check for "NTN" when termfreq != N
+ enquire.set_query(Xapian::Query("word"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("NTN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 2);
+ // doc 2 should have higher weight than 4 as only tf(wdf) will dominate.
+ mset_expect_order(mset, 2, 4);
+ TEST_EQUAL_DOUBLE(mset[0].get_weight(), (8*log(6/2)));
+
+ // Check for "BNN" and for both branches of 'B'.
+ enquire.set_query(Xapian::Query("test"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("BNN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 1);
+ mset_expect_order(mset, 1);
+ TEST_EQUAL_DOUBLE(mset[0].get_weight(), 1.0);
+
+ // Check for "LNN" and for both branches of 'L'.
+ enquire.set_query(Xapian::Query("word"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("LNN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 2);
+ mset_expect_order(mset, 2, 4);
+ TEST_EQUAL_DOUBLE(mset[0].get_weight(), (1+log(8))); // idfn=1 and so wt=tfn=1+log(tf)
+ TEST_EQUAL_DOUBLE(mset[1].get_weight(), 1.0); // idfn=1 and wt=tfn=1+log(tf)=1+log(1)=1
+
+ // Check for "SNN"
+ enquire.set_query(Xapian::Query("paragraph"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("SNN")); // idf=1 and tfn=tf*tf
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 5);
+ mset_expect_order(mset,2,1,4,3,5);
+ TEST_EQUAL_DOUBLE(mset[0].get_weight(), 9.0);
+ TEST_EQUAL_DOUBLE(mset[4].get_weight(), 1.0);
+
+ // Check for "NTN" when termfreq=N
+ enquire.set_query(Xapian::Query("this")); // N=termfreq amd so idfn=0 for "T"
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("NTN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 6);
+ mset_expect_order(mset,1,2,3,4,5,6);
+ for (int i=0; i<6;++i) {
+ TEST_EQUAL_DOUBLE(mset[i].get_weight(), 0.0);
+ }
+
+ // Check for "NPN" and for both branches of 'P'
+ enquire.set_query(Xapian::Query("this")); // N=termfreq and so idfn=0 for "P"
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("NPN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 6);
+ mset_expect_order(mset,1,2,3,4,5,6);
+ for (int i=0; i<6;++i) {
+ TEST_EQUAL_DOUBLE(mset[i].get_weight(), 0.0);
+ }
+
+ enquire.set_query(Xapian::Query("word"));
+ enquire.set_weighting_scheme(Xapian::TfIdfWeight("NPN"));
+ mset = enquire.get_mset(0, 10);
+ TEST_EQUAL(mset.size(), 2);
+ mset_expect_order(mset,2,4);
+ TEST_EQUAL_DOUBLE(mset[0].get_weight(), 8*log((6-2)/2));
+ TEST_EQUAL_DOUBLE(mset[1].get_weight(), 1*log((6-2)/2));
+
+ return true;
+}
+
class CheckInitWeight : public Xapian::Weight {
public:
double factor;
View
1  xapian-core/weight/Makefile.mk
@@ -9,5 +9,6 @@ lib_src +=\
weight/bm25weight.cc\
weight/boolweight.cc\
weight/tradweight.cc\
+ weight/tfidfweight.cc\
weight/weight.cc\
weight/weightinternal.cc
View
149 xapian-core/weight/tfidfweight.cc
@@ -0,0 +1,149 @@
+/** @file tfidfweight.cc
+ * @brief Xapian::TfIdfWeight class - The TfIdf weighting scheme
+ */
+/* Copyright (C) 2013 Aarsh Shah
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <config.h>
+
+#include "xapian/weight.h"
+#include <cmath>
+
+#include "debuglog.h"
+#include "omassert.h"
+
+#include "xapian/error.h"
+
+using namespace std;
+
+namespace Xapian {
+
+TfIdfWeight *
+TfIdfWeight::clone() const
+{
+ return new TfIdfWeight(normalizations);
+}
+
+void
+TfIdfWeight::init(double)
+{
+ // None required
+}
+
+string
+TfIdfWeight::name() const
+{
+ return "Xapian::TfIdfWeight";
+}
+
+string
+TfIdfWeight::serialise() const
+{
+ return normalizations;
+}
+
+TfIdfWeight *
+TfIdfWeight::unserialise(const string & s) const
+{
+ if (s.length() != 3)
@ojwb Owner
ojwb added a note

Kill these random blank lines at the start of functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ throw Xapian::SerialisationError("Extra data in TfIdfWeight::unserialise()");
+ else return new TfIdfWeight(s);
+}
+
+double
+TfIdfWeight::get_sumpart(Xapian::termcount wdf, Xapian::termcount) const
+{
+ Xapian::doccount termfreq = 1;
+ if (normalizations[1] != 'N') termfreq = get_termfreq();
+ return (get_wtn(get_wdfn(wdf, normalizations[0]) * get_idfn(termfreq, normalizations[1]), normalizations[2]));
+}
+
+// An upper bound can be calculated simply on the basis of wdf_max as termfreq and N are constants.
+double
+TfIdfWeight::get_maxpart() const
+{
+ Xapian::doccount termfreq = 1;
+ if (normalizations[1] != 'N') termfreq = get_termfreq();
+ Xapian::termcount wdf_max = get_wdf_upper_bound();
+ return (get_wtn(get_wdfn(wdf_max, normalizations[0]) * get_idfn(termfreq, normalizations[1]), normalizations[2]));
+}
+
+// There is no extra per document component in the TfIdfWeighting scheme.
+double
+TfIdfWeight::get_sumextra(Xapian::termcount) const
+{
+ return 0;
+}
+
+double
+TfIdfWeight::get_maxextra() const
+{
+ return 0;
+}
+
+// Return normalized wdf, idf and weight depending on the normalization string.
+double
+TfIdfWeight::get_wdfn(Xapian::termcount wdf, char c) const
+{
+ switch (c) {
+ case 'N':
+ return wdf;
+ case 'B':
+ if (wdf == 0) return 0;
+ else return 1.0;
+ case 'S':
+ return (wdf * wdf);
+ case 'L':
+ if (wdf == 0) return 0;
+ else return (1 + log(wdf));
+ default:
+ return wdf;
+ }
+}
+
+double
+TfIdfWeight::get_idfn(Xapian::doccount termfreq, char c) const
+{
+ double N = 1.0;
+ if (c != 'N') N = get_collection_size();
+ switch (c) {
+ case 'N':
+ return 1.0;
+ case 'T':
+ return (log(N / termfreq));
+ case 'P':
+ if (N == termfreq) return 0; // All documents are indexed by the term
+ else return log((N - termfreq) / termfreq);
+ default:
+ return (log(N / termfreq));
+ }
+}
+
+double
+TfIdfWeight::get_wtn(double wt, char c) const
+{
+/* Include future implementations of weight normalizations in the switch
+ construct */
+ switch (c) {
+ case 'N':
+ return wt;
+ default:
+ return wt;
+ }
+}
+
+}
Something went wrong with that request. Please try again.