Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BM25+ weighting function #104

Closed
wants to merge 24 commits into from
Closed

Conversation

ivmarkp
Copy link
Contributor

@ivmarkp ivmarkp commented Jun 3, 2016

  • The original bm25weight.cc file that contains implementation of BM25 weighting function is modified to add support for BM25+ weighting function without disturbing existing functionalities of BM25 function.
  • Also, modified existing test cases of BM25 to extend test coverage for BM25+ function.
  • Added API support in weight.h for the same.

Issue marked as "FIXME" about dealing with negative weights in bm25weight.cc that Olly mentioned on IRC is fixed by default when BM25+ is used. BM25+ can never assign a negative termweight as in BM25+ termweight is calculated as N+1 / df(t) which translates to (total doc in collection + 1) / termfreq in Xapian terminology. Since, term frequency can't possibly be greater than N so log((total doc in collection + 1) / termfreq) can never be negative. I'd say it can be regarded as one of the benefits of BM25+ function.

Please note that I haven't run any automated tests yet using make check. Just wanted to get the code reviewed first. Will make sure to run all the tests after getting feedback.

BM25Weight::BM25Weight(bool& bm25+)
{
if(bm25+ != 0 && bm25+ != 1)
throw Xapian::InvalidArgumentError("Boolean input is invalid.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it isn't aligned properly; it's the body of the if so should be aligned further across than the if itself. Also, you're indenting to 8 spaces (as a TAB) rather than 4 as the rest of the file.

@jaylett
Copy link
Contributor

jaylett commented Jun 3, 2016

@ivmarkp when you work on the changes I've suggested, it's also a good idea to run the final diff through our xapian-check-diff script, which will look for common formatting errors (mostly around spacing). The easiest way to do that is:

$ git diff github/master HEAD | ./xapian-maintainer-tools/xapian-check-patch

Assuming your remote to Xapian on github is called github and you're sitting at the top of the repository.

@ojwb
Copy link
Contributor

ojwb commented Jun 6, 2016

Please note that I haven't run any automated tests yet using make check. Just wanted to get the code reviewed first. Will make sure to run all the tests after getting feedback.

There's something to be said for early review, but your code here doesn't even compile (e.g. bm25+ isn't a valid parameter name). It's inefficient for you to open a PR and wait for us to review only to tell you about errors which the compiler could have found in seconds (even with lucky timing, you're unlikely to get a review from us in under 15 minutes, and it's more likely to be at least half a day).

For similar reasons, at least run the exist testcases before opening a PR - if your changes make existing tests fail, there's something wrong. It can happen that the issue is an existing testcase legitimately needs adjusting (it's not the normal situation, but it happens), but even then we can't merge a patch which makes existing tests fail - in this situation, those testcases need adjusting.

Running xapian-check-patch before submitting and fixing any issues it highlights is similarly a good idea - it's much faster feedback than waiting for a human to point out those same issues (which is why it exists).

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 7, 2016

@ojwb @jaylett Sorry, like you said I should have had compiled and run all the tests beforehand.
Although, I refactored the code following the previous comments and suggestions (thanks so much). Code is finally compiling but tests are failing with errors messages like SIGFPE at 0x80435b676 which are currently beyond my understanding as I have never encountered such errors before. Is there anything that can be done to find what exactly in the code causing the tests to fail? I'd commit the changes so you can look at the code if that's okay.

@gauravaror
Copy link
Member

gauravaror commented Jun 8, 2016

@ivmarkp SIGFPE is the easiest runtime error to debug - it is a floating point error. It is virtually always caused by a division by 0, so check any divisions or modulo operations in your code carefully.

It's always advisable to google the error you get, i generally tells you a lot about error and reason.

To solve this error, try checking test case which is failing for division by 0.

To debug any runtime error i will suggest using gdb.

Run your tests under gdb and see where are you getting the exception.

  1. Run your program under gdb.
  2. Add a breakpoint to catch exception using(catch throw)
  3. see the exact point when compilar stops at exception (bt)
  4. You can change stack level to further debug what causes the error.

If you have not worked with gdb, try reading a tutorial for gdb(it might be very helpful) http://www.tutorialspoint.com/gnu_debugger/

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 8, 2016

These tests errors have given me a couple of sleepless nights. I'm kind of stuck at this point. Thanks for the helpful links for gdb though. Hopefully, I'll locate the bugs and remove them and be able to move further according the project timeline.

@gauravaror
Copy link
Member

@ivmarkp I am happy to help/assist you in debugging these errors, If you have done small amount of research into error your self on IRC(It's easy ). I am generally around my laptop from 11AM IST to 6-7PM IST, hence higher probability of replying .

If you want we can create a account on quassel core on VM, which can help you stay connected on IRC and see all messages when you are back online

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 8, 2016

@samuelharden I've read about it as you advised. I found this link that has information on various program errors including SIGFPE. I'm currently verifying test cases manually for division by zero error in the code.

Yes, it'd be great to have an account on quassel! I have been wanting to setup my IRC account on quassel from past few months but couldn't due to unavailability of server (or 24x7 running PC) at home.

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 10, 2016

@ojwb @jaylett I have implemented a new sub class Xapian::BM25PlusWeight for BM25+ weighting function. It passed all the tests including the new ones. Please review the changes and let me know if any changes are required. Thanks!

P.S. Little update on what I have been doing -- Last few days, I was busy trying to modify the bm25weight.cc to incorporate functionalities for BM25+ weighting function but that didn't turn out very well as make check tests were failing. I tried verifying the tests manually. A couple of days were spent trying to refactor and debug the same code but no success. Finally, I decided to implement a new subclass for BM25+ function which seems to have worked well.

*/
class XAPIAN_VISIBILITY_DEFAULT BM25PlusWeight : public BM25Weight {
/// Factor to multiply the document length by.
mutable Xapian::doclength len_factor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need to repeat these members, given you're inheriting from BM25Weight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These data members are used for various calculations in member functions of BM25PlusWeight subclass and they ,being private members by default, can't be accessed from BM25Weight. So, I repeated them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should probably promote them to protected, otherwise each BM25PlusWeight object has lots of members it's not using.

@jaylett
Copy link
Contributor

jaylett commented Jun 12, 2016

I'm not at home, so I can't easily build and play with this, but you don't seem to have rolled back your previous changes to the BM25Weight implementation, so this may still not compile.

You also seem (beyond serialisation checks) to only have one test for the BM25+ implementation, which only uses one configuration. I'm not sure if this is enough without having the facilities to dig a bit deeper, but it may be worth thinking about whether there are corner cases you aren't currently testing.

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 13, 2016

@jaylett Thanks for the review. I'll write more test cases to improve overall test coverage. And I made last commits from Github interface after making changes to a fresh clone locally and building it. I forgot to rollback the previous changes I made in bm25weight.cc here. I'm sorry about that. Working on it now.

@jaylett
Copy link
Contributor

jaylett commented Jun 15, 2016

@ivmarkp you have a merge commit in this; generally you shouldn't submit PRs with merge commits at all; this one in particular is weird because the message suggests you've merged from the bm25+ plus into itself, when the merge I assume you intended was from upstream master into bm25+. You may want to review the article I wrote on remotes and rebasing, which I posted to the list recently.

Additionally, in the merge commit you seem to have committed at least one conflicted file without resolving it. (You may get the same issue when rebasing to make a merge-free branch, so you should be careful about that.)

@ivmarkp
Copy link
Contributor Author

ivmarkp commented Jun 15, 2016

@jaylett Sorry, this seems all messed. Everything looks just fine in my local repo but now all the changes have extra indentation and spaces here. I had checked the patches for any indentation issues using git diff origin/bm25+ | xapian-maintainer-tools/xapian-check-patch and no issues were reported. I have set tab character to 8 spaces as instructed in HACKING file. What has gone wrong here? :-/

Actually, the local clone was not up-to-date with remote repo (github.com/ivmarkp/xapian) so I pulled the changes and merged them. Then, made the new changes and pushed them. All of the past commits have also come up in this PR. This has become confusing. Please help me fix this.

@@ -39,7 +40,7 @@ def my_open_write(dest):
if hasattr(dest, "write"):
return dest
else:
return open(dest, 'w')
return io.open(dest, 'w', encoding = 'utf8')


class Doxy2SWIG:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful change, but not really related to the stated purpose of the PR ("Add support for BM25+ weighting function") - it just happens this change adds the first instance of a non-ASCII character to the C++ documentation comments (or at least to those which are taken through to Python). So it would arguably be better to deal with separately.

It's also incomplete - the same issue affects Python3.

And for Python2, even with the fix in this commit I get an error when the Python interpreter tries to import the wrapper due to the default encoding for Python2 being ASCII (since 2.5 - before that it was latin-1 which wouldn't give an error, but wouldn't work as intended):

/usr/bin/python2 -c 'import xapian'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "xapian/__init__.py", line 3099
SyntaxError: Non-ASCII character '\xc3' in file xapian/__init__.py on line 3100, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Python3's default source encoding if UTF-8, so no equivalent change is needed.

So I've cherry-picked this commit as 215c8cc, with the identical fix applied for Python3, and then committed c277f44 to specify the source ending for Python2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is it okay to leave this change in this PR or remove it now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the final step before merging will be to rebase to squash the "fixup" commits onto the commits they fixup - I'd suggest we rebase onto git master as part of that, which should automatically drop this commit as already applied (and similarly should drop the trailing whitespace change that's in another commit).

@@ -110,7 +109,7 @@ BM25PlusWeight::get_sumpart(Xapian::termcount wdf, Xapian::termcount len,
// Parameter delta (δ) is a pseudo tf value to control the scale of the
// tf lower bound. δ can be tuned for e.g from 0.0 to 1.5 but BM25+ can
// still work effectively across collections with a fixed δ = 1.0
RETURN((termweight * (wdf_double / denom)) + param_delta);
RETURN((termweight * ((((param_k1 + 1) * wdf_double) / denom) + param_delta)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now a (( and )) around the whole return expression, which just makes it harder to read the expression - this means exactly the same:

RETURN(termweight * ((((param_k1 + 1) * wdf_double) / denom) + param_delta));

But also (x * y) / z and x * y / z mean the same since * and / have the same precedence (http://en.cppreference.com/w/cpp/language/operator_precedence) and + has lower precedence, so you could actually lose another two pair of parentheses:

RETURN(termweight * ((param_k1 + 1) * wdf_double / denom + param_delta));

This step's more arguable - some people prefer explicit parentheses, especially when there are operators with different precedence involved. But let's definitely lose the extra outer pair (and similarly below).

Also in a case like this where the initial implementation has a bug, it's a really good idea to add a regression testcase which will fail if the fix ever got undone. Bugs have a tendency to recur - if the code got misimplemented in a particular way, it's natural that someone reworking it may make the same mistake again, and it's also good not to repeat the same mistake as it looks bad to users, management, etc - and adding regression testcases is a big help in preventing that.

ojwb pushed a commit that referenced this pull request Aug 16, 2016
@ojwb
Copy link
Contributor

ojwb commented Aug 16, 2016

Thanks for the quick response - all looks good to me now.

@jaylett suggested on IRC that it was probably best to just squash this into one commit when merging to master, so I've done that and pushed it.

@ojwb ojwb closed this Aug 16, 2016
richhiey pushed a commit to richhiey/xapian that referenced this pull request Aug 16, 2016
ojwb pushed a commit that referenced this pull request Sep 18, 2016
Squashed down from changes in #104

(cherry picked from commit b349eb6)
jsonn pushed a commit to jsonn/pkgsrc that referenced this pull request Nov 7, 2016
API:

* Constructing a Query for a non-reference counted PostingSource object will
  now try to clone the PostingSource object (as happened in 1.3.4 and
  earlier).  This clone code was removed as part of the changes in 1.3.5 to
  support optional reference counting of PostingSource objects, but that breaks
  the case when the PostingSource object is on the stack and goes out of scope
  before the Query object is used.  Issue reported by Till Schäfer and analysed
  by Daniel Vrátil in a bug report against Akonadi:
  https://bugs.kde.org/show_bug.cgi?id=363741

* Add BM25PlusWeight class implementing the BM25+ weighting scheme, implemented
  by Vivek Pal (xapian/xapian#104).

* Add PL2PlusWeight class implementing the PL2+ weighting scheme, implemented
  by Vivek Pal (xapian/xapian#108).

* LMWeight: Implement Dir+ weighting scheme as DIRICHLET_PLUS_SMOOTHING.
  Patch from Vivek Pal.

* Add CoordWeight class implementing coordinate matching.  This can be useful
  for specialised uses - e.g. to implement sorting by the number of matching
  filters.

* DLHWeight,DPHWeight,PL2Weight: With these weighting schemes, the formulae
  can give a negative weight contribution for a term in extreme cases.  We
  used to try to handle this by calculating a per-term lower bound on the
  contribution and subtracting this from the contribution, but this idea
  is fundamentally flawed as the total offset it adds to a document depends on
  what combination of terms that document matches, meaning in general the
  offset isn't the same for every matching document.  So instead we now clamp
  each term's weight contribution to be >= 0.

* TfIdfWeight: Always scale term weight by wqf - this seems the logical
  approach as it matches the weighting we'd get if we weighted every non-unique
  term in the query, as well as being explicit in the Piv+ formula.

* Fix OP_SCALE_WEIGHT to work with all weighting schemes - previously it was
  ignored when using PL2Weight and LMWeight.

* PL2Weight: Greatly improve upper bound on weight:
  + Split the weight equation into two parts and maximise each separately as
    that gives an easily solvable problem, and in common cases the maximum is
    at the same value of wdfn for both parts.  In a simple test, the upper
    bounds are now just over double the highest weight actually achieved -
    previously they were several hundred times.  This approach was suggested by
    Aarsh Shah in: xapian/xapian#48
  + Improve upper bound on normalised wdf (wdfn) - when wdf_upper_bound >
    doclength_lower_bound, we get a tighter bound by evaluating at
    wdf=wdf_upper_bound.  In a simple test, this reduces the upper bound on
    wdfn by 36-64%, and the upper bound on the weight by 9-33%.

* PL2Weight: Fix calculation of upper_bound when P2>0.  P2 is typically
  negative, but for a very common term it can be positive and then we should
  use wdfn_lower not wdfn_upper to adjust P_max.

* Weight::unserialise(): Check serialised form is empty when unserialising
  parameter-free schemes BoolWeight, DLHWeight and DPHWeight.

* TermGenerator::set_stopper_strategy(): New method to control how the Stopper
  object is used.  Patch from Arnav Jain.

* QueryParser: Fix handling of CJK query over multiple prefixes.  Previously
  all the n-gram terms were AND-ed together - now we AND together for each
  prefix, then OR the results.  Fixes #719, reported by Aaron Li.

* Add Database::get_revision() method which provides access to the database
  revision number for chert and glass, intended for use by xapiand.  Marked
  as experimental, so we don't have to go through the usual deprecation cycle
  if this proves not to be the approach we want to take.  Fixes #709,
  reported by German M. Bravo.

* Mark RangeProcessor constructor as `explicit`.

* Update to Unicode 9.0.0.

* Reimplement ESet and ESetIterator as we did for MSet and MSetIterator in
  1.3.5.  ESetIterator internally now counts down to the end of the ESet, so
  the end test is now against 0, rather than against eset.size().  And more of
  the trivial methods are now inlined, which reduces the number of relocations
  needed to load the library, and should give faster code which is a very
  similar size to before.

* MSetIterator and ESetIterator are now STL-compatible random_access_iterators
  (previously they were only bidirectional_iterators).

* TfIdfWeight: Support freq and squared IDF normalisations.  Patch from Vivek
  Pal.

* New Xapian::Query::OP_INVALID to provide an "invalid" query object.

* Reject OP_NEAR/OP_PHRASE with non-leaf subqueries early to avoid a
  potential segmentation fault if the non-leaf subquery decayed at
  just the wrong moment.  See #508.

* Reduce positional queries with a MatchAll or PostingSource subquery to
  MatchNothing (since these subqueries have no positional information, so
  the query can't match).

* Deprecate ValueRangeProcessor and introduce new RangeProcessor class as
  a replacement.  RangeProcessor()::operator()() method returns Xapian::Query,
  so a range can expand to any query.  OP_INVALID is used to signal that
  a range is not recognised.  Fixes #663.

* Combining of ranges over the same quantity with OP_OR is now handled by
  an explicit "grouping" parameter, with a sensible default which works
  for value range queries.  Boolean term prefixes and FieldProcessor now
  support "grouping" too, so ranges and other filters can now be grouped
  together.

* Formally deprecate WritableDatabase::flush().  The replacement commit()
  method was added in 1.1.0, so code can be switched to use this and still
  work with 1.2.x.

* Fix handling of a self-initialised PIMPL object (e.g. Xapian::Query q(q);).
  Previously the uninitialised pointer was copied to itself, resulting in
  undefined behaviour when the object was used to destroyed.  This isn't
  something you'd see in normal code, but it's a cheap check which can probably
  be optimised away by the compiler (GCC 6 does).

* The Snipper class has been replaced with a new MSet::snippet() method.
  The implementation has also been redone - the existing implementation was
  slower than ideal, and didn't directly consider the query so would sometimes
  selects a snippet which doesn't contain any of the query terms (which users
  quite reasonably found surprising).  The new implementation is faster, will
  always prefer snippets containing query terms, and also understands exact
  phrases and wildcards.  Fixes #211.

* Add optional reference counting support for ErrorHandler, ExpandDecider,
  KeyMaker, PostingSource, Stopper and TermGenerator.  Fixes #186, reported
  by Richard Boulton.  (ErrorHandler's reference counting isn't actually used
  anywhere in xapian-core currently, but means we can hook it up in 1.4.x if
  ticket #3 gets addressed).

* Deprecate public member variables of PostingSource.  The new getters and/or
  setters added in 1.2.23 and 1.3.5 are preferred.  Fixes #499, reported by
  Joost Cassee.

* Reimplement MSet and MSetIterator.  MSetIterator internally now counts down
  to the end of the MSet, so the end test is now against 0, rather than against
  mset.size().  And more of the trivial methods are now inlined, which reduces
  the number of relocations needed to load the library, and should give faster
  code which is a very similar size to before.

* Only issue prefetch hints for documents if MSet::fetch() is called.  It's not
  useful to send the prefetch hint right before the actual read, which was
  happening since the implementation of prefetch hints in 1.3.4.  Fixes #671,
  reported by Will Greenberg.

* Fix OP_ELITE_SET selection in multi-database case - we were selecting
  different sets for each subdatabase, but removing the special case check for
  termfreq_max == 0 solves that.

* Remove "experimental" marker from FieldProcessor, since we're happy with the
  API as-is.  Reported by David Bremner on xapian-discuss.

* Remove "experimental" marker from Database::check().  We've not had any
  negative feedback on the current API.

* Databse::check() now checks that doccount <= last_docid.

* Database::compact() on a WritableDatabase with uncommitted changes could
  produce a corrupted output.  We now throw Xapian::InvalidOperationError in
  this case, with a message suggesting you either commit() or open the database
  from disk to compact from.  Reported by Will Greenberg on #xapian-discuss

* Add Arabic stemmer.  Patch from Assem Chelli in
  xapian/xapian#45

* Improve the Arabic stopword list.  Patch from Assem Chelli.

* Make functions defined in xapian/iterator.h 'inline'.

* Don't force the user to specify the metric in the geospatial API -
  GreatCircleMetric is probably what most users will want, so a sensible
  default.

* Xapian::DBCHECK_SHOW_BITMAP: This was added in 1.3.0 (so has never been in
  a stable release) and was superseded by Xapian::DBCHECK_SHOW_FREELIST in
  1.3.2, so just remove it.

* Make setting an ErrorHandler a no-op - this feature is deprecated and we're
  not aware of anyone using it.  We're hoping to rework ErrorHandler in 1.4.x,
  which will be simpler without having to support the current behaviour as well
  as the new.  See #3.

* Update to Unicode 8.0.0.  Fixes #680.

* Overhaul database compaction API.  Add a Xapian::Database::compact() method,
  with the Database object specifying the source database(s).
  Xapian::Compactor is now just a functor to use if you want to control
  progress reporting and/or the merging of user metadata.  The existing API
  has been reimplemented using the new one, but is marked as deprecated.

* Add support for a default value when sorting.  Fixes #452, patch from
  Richard Boulton.

* Make all functor objects non-copyable.  Previously some were, some weren't,
  but it's hard to correctly make use of this ability.  Fixes #681.

* Fix use after free with WILDCARD_LIMIT_MOST_FREQUENT.  If we tried to open a
  postlist after processing such a wildcard, the postlist hint could be
  pointing to a PostList object which had been deleted.  Fixes #696, reported
  by coventry.

* Add support for optional reference counting of MatchSpy objects.

* Improve Document::get_description() - the output is now always valid UTF-8,
  doesn't contain implementation details like "Document::Internal", and more
  clearly reports if the document is linked to a database.

* Remove XAPIAN_CONST_FUNCTION marker from sortable_serialise_() helper, as it
  writes to the passed in buffer, so it isn't const or pure.  Fixes
  decvalwtsource2 testcase failure when compiled with clang.

* Make PostingSource::set_maxweight() public - it's hard to wrap for the
  bindings as a protected method.  Fixes #498, reported by Richard Boulton.

* Database:

  + Add new flag Xapian::DB_RETRY_LOCK which allows opening a database for
    writing to wait until it can get a write lock.  (Fixes #275, reported by
    Richard Boulton).

  + Fix Database::get_doclength_lower_bound() over multiple databases when some
    are empty or consist only of zero-length documents.  Previously this would
    report a lower bound of zero, now it reports the same lowest bound as a
    single database containing all the same documents.

  + Database::check(): When checking a single table, handle the ".glass"
    extension on glass database tables, and use the extension to guide the
    decision of which backend the table is from.

* Query:

  + Add new OP_WILDCARD query operator, which expands wildcards lazily, so now
    we create the PostList tree for a wildcard directly, rather than creating
    an intermediate Query tree.  OP_WILDCARD offers a choice of ways to limit
    wildcard expansion (no limit, throw an exception, use the first N by term
    name, or use the most frequent N).  (See tickets #48 and #608).

* QueryParser:

  + Add new set_max_expansion() method which provides access to OP_WILDCARD's
    choice of ways to limit expansion and can set limits for partial terms as
    well as for wildcards.  Partial terms now default to the 100 most frequent
    matching terms.  (Completes #608, reported by boomboo).

  + Deprecate set_max_wildcard_expansion() in favour of set_max_expansion().

* Add support for optional reference counting of FieldProcessor and
  ValueRangeProcessor objects.

* Update Unicode character database to Unicode 7.0.0.

* New Xapian::Snipper class from Mihai Bivol's GSOC 2012 project.  (mostly
  fixes #211)

* Fix all get_description() methods to always return UTF-8 text.  (fixes #620)

* Database::check():

  + Alter to take its "out" parameter as a pointer to std::ostream instead of a
    reference, and make passing NULL mean "do not produce output", and make
    the second and third parameters optional, defaulting to a quiet check.

  + Escape invalid UTF-8 data in keys and tags reported by xapian-check, using
    the same code we use to clean up strings returned by get_description()
    methods.

  + Correct failure message which talks above the root block when it's actually
    testing a leaf key.

  + Rename DBCHECK_SHOW_BITMAP to DBCHECK_SHOW_FREELIST (old name still
    provided for now, but flagged as deprecated - DBCHECK_SHOW_BITMAP was new
    in 1.3.0, so will likely be removed before 1.4.0).

* Methods and functions which take a string to unserialise now consistently
  call that parameter "serialised".

* Weight: Make number of distinct terms indexing each document and the
  collection frequency of the term available to subclasses.  Patch from
  Gaurav Arora's Language Modelling branch.

* WritableDatabase: Add support for multiple subdatabases, and support opening
  a stub database containing multiple subdatabases as a WritableDatabase.

* WritableDatabase can now be constructed from just a pathname (defaulting to
  opening the database with DB_CREATE_OR_OPEN).

* WritableDatabase: Add flags which can be bitwise OR-ed into the second
  argument when constructing:

  + Xapian::DB_NO_SYNC: to disable use of fsync, etc

  + Xapian::DB_DANGEROUS: to enable in-place updates

  + Xapian::DB_BACKEND_CHERT: if creating, create a chert database

  + Xapian::DB_BACKEND_GLASS: if creating, create a glass database

  + Xapian::DB_NO_TERMLIST: create a database without a termlist (see #181)

  + Xapian::DB_FULL_SYNC flag - if this is set for a database, we use the Mac
    OS X F_FULL_SYNC instead of fdatasync()/fsync()/etc on the version file
    when committing.

* Database: Add optional flags argument to constructor - the following can be
  bitwise OR-ed into it:

  + Xapian::DB_BACKEND_CHERT (only open a chert database)

  + Xapian::DB_BACKEND_GLASS (only open a glass database)

  + Xapian::DB_BACKEND_STUB (only open a stub database)

* Xapian::Auto::open_stub() and Xapian::Chert::open() are now deprecated in
  favour of these new flags.

* Add LMWeight class, which implements the Unigram Language Modelling weighting
  scheme.  Patch from Gaurav Arora.

* Add implementations of a number of DfR weighting schemes (BB2, DLH, DPH,
  IfB2, IneB2, InL2, PL2).  Patches from Aarsh Shah.

* Add support for the Bo1 query expansion scheme.  Patch from Aarsh Shah.

* Add Enquire::set_time_limit() method which sets a timelimit after which
  check_at_least will be disabled.

* Database: Trying to perform operations on a database with no subdatabases now
  throws InvalidOperationError not DocNotFoundError.

* Query: Implement new OP_MAX query operator, which returns the maximum weight
  of any of its subqueries.  (see #360)

* Query: Add methods to allow introspection on Query objects - currently you
  can read the leaf type/operator, how many subqueries there are, and get a
  particular subquery.  For a query which is a term, Query::get_terms_begin()
  allows you to get the term.  (see #159)

* Query: Only simplify OP_SYNONYM with a single subquery if that subquery is a
  term or MatchAll.

* Avoid two vector copies when storing term positions in most common cases.

* Reimplement version functions to use a single function in libxapian which
  returns a pointer to a static const struct containing the version
  information, with inline wrappers in the API header which call this.  This
  means we only need one relocation instead of 4, reducing library load time a
  little.

* Make TermGenerator flags an anonymous enum, and typedef TermGenerator::flags
  to int for backward compatibility with existing user code which uses it.

* Stem: Fix incorrect Unicode codepoints for o-double-acute and u-double-acute
  in the Hungarian Snowball stemmer.  Reported by Tom Lane to snowball-discuss.

* Stem: Add an early english stemmer.

* Provide the stopword lists from Snowball plus an Arabic one, installed in
  ${prefix}/share/xapian-core/stopwords/.  Patch from Assem Chelli, fixes #269.

* Improve check for direct inclusion of Xapian subheaders in user code to
  catch more cases.

* Add simple API to help with creating language-idiomatic iterator wrappers
  in <xapian/iterator.h>.

* Give an compilation error if user code tries to include API headers other
  than xapian.h directly - these other headers are an internal implementation
  detail, but experience has shown that some people try to include them
  directly.  Please just use '#include <xapian.h>' instead.

* Update Unicode character database to Unicode 6.2.0.

* Add FieldProcessor class (ticket#128) - currently marked as an experimental
  API while we sort out how best to sort out exactly how it interacts with
  other QueryParser features.

* Add implementation of several TF-IDF weighting schemes via a new TfIdfWeight
  class.

* Add ExpandDeciderFilterPrefix class which only return terms with a particular
  prefix.  (fixes #467)

* QueryParser: Adjust handling of Unicode opening/closing double quotes - if a
  quoted boolean term was started with ASCII double quote, then only ASCII
  double quote can end it, as otherwise it's impossible to quote a term
  containing Unicode double quotes.

* Database::check(): If the database can't be opened, don't emit a bogus
  warning about there being too many documents to cross-check doclens.

* TradWeight,BM25Weight: Throw SerialisationError instead of NetworkError if
  unserialise() fails.

* QueryParser: Change the default stemming strategy to STEM_SOME, to eliminate
  the API gotcha that setting a stemmer is ignored until you also set a
  strategy.

* Deprecate Xapian::ErrorHandler.  (ticket#3)

* Stem: Generate a compact and efficient table to decode language names.  This
  is both faster and smaller than the approach we were using, with the added
  benefit that the table is auto-generated.

* xapian.h:

  + Add check for Qt headers being included before us and defining
    'slots' as a macro - if they are, give a clear error advising how to work
    around this (previously compilation would fail with a confusing error).

  + Add a similar check for Wt headers which also define 'slots' as a macro
    by default.

* Update Unicode character database to Unicode 6.1.0.  (ticket#497)

* TermIterator returned by Enquire::get_matching_terms_begin(),
  Query::get_terms_begin(), Database::synonyms_begin(),
  QueryParser::stoplist_begin(), and QueryParser::unstem_begin() now stores the
  list of terms to iterate much more compactly.

* QueryParser:

  + Allow Unicode curly double quote characters to start and/or end phrases.

  + The set_default_op() method will now reject operators which don't make
    sense to set.  The operators which are allowed are now explicitly
    documented in the API docs.

* Query: The internals have been completely reimplemented (ticket#280).  The
  notable changes are:

  + Query objects are smaller and should be faster.

  + More readable format for Query::get_description().

  + More compact serialisation format for Query objects.

  + Query operators are no longer flattened as you build up a tree (but the
    query optimiser still combines groups of the same operator).  This means
    that Query objects are truly immutable, and so we don't need to copy Query
    objects when composing them.  This should also fix a few O(n*n) cases when
    building up an n-way query pair-wise.  (ticket#273)

  + The Query optimiser can do a few extra optimisations.

* There's now explicit support for geospatial search (this API is currently
  marked as experimental).  (ticket#481)

* There's now an API (currently experimental) for checking the integrity of
  databases (partly addresses ticket#238).

* Database::reopen() now returns true if the database may have been reopened
  (previously it returned void).  (ticket#548)

* Deprecate Xapian::timeout in favour of POSIX type useconds_t.

* Deprecate Xapian::percent and use int instead in the API and our own code.

* Deprecate Xapian::weight typedef in favour of just using double and change
  all uses in the API and our own code.  (ticket#560)

* Rearrange members of Xapian::Error to reduce its size (from 48 to 40 bytes on
  x86-64 Linux).

* Assignment operators for PositionIterator and TermIterator now return *this
  rather than void.

* PositionIterator, PostingIterator, TermIterator and ValueIterator now
  handle their reference counts in hand-crafted code rather than using
  intrusive_ptr/RefCntPtr, which means the compiler can inline the destructor
  and default constructor, so a comparison to an end iterator should now
  optimise to a simple NULL pointer check, but without the issues which the
  ValueIteratorEnd_ proxy class approach had (such as not working in templates
  or some cases of overload resolution).

* Enquire:

  + Previously, Enquire::get_matching_terms_begin() threw InvalidArgumentError
    if the query was empty.  Now we just return an end iterator, which is more
    consistent with how empty queries behave elsewhere.

  + Remove the deprecated old-style match spy approach of using a MatchDecider.

* Remove deprecated Sorter class and MultiValueSorter subclass.

* Xapian::Stem:

  + Add stemmers for Armenian (hy), Basque (eu), and Catalan (ca).

  + Stem::operator= now returns a reference to the assigned-to object.


testsuite:

* OP_SCALE_WEIGHT: Check top weight is non-zero - if it is zero, tests which
  try to check that OP_SCALE_WEIGHT works will always pass.

* testsuite: Check SerialisationError descriptions from Xapian::Weight
  subclasses mention the weighting scheme name.

* Merge queryparsertest and termgentest into apitest.  Their testcases now use
  the backend manager machinery in the testharness, so we don't have to
  hard-code use of inmemory and chert backends, but instead run them under all
  backends which support the required features.  This fixes some test failures
  when both chert and glass are disabled due to trying to run spelling tests
  with the inmemory backend.

* Avoid overflowing collection frequency in totaldoclen1.  We're trying to test
  total document length doesn't wrap, so avoid collection freq overflowing in
  the process, as that triggers errors when running the testsuite under ubsan.
  We should handle collection frequency overflow better, but that's a separate
  issue.

* Add some test coverage for ESet::get_ebound().

* Fix testcase notermlist1 to check correct table extension - ".glass" not
  ".DB" (chert doesn't support DB_NO_TERMLIST).

* unittest: We can't use Assert() to unit test noexcept code as it throws an
  exception if it fails.  Instead set up macros to set a variable and return if
  an assertion fails in a unittest testcase, and check that variable in the
  harness.

* Add unit test for internal C_isupper(), etc functions.

* If command line option --verbose/-v isn't specified, set the verbosity level
  from environmental variable VERBOSE.

* Re-enable replicate3 for glass, as it no longer fails.

* Add more test coverage for get_unique_terms().

* Don't leave an extra fd open when starting xapian-tcpsrv for remotetcp tests.

* Extend checkstatsweight1 to check that Weight::get_collection_freq() returns
  the same number as Database::get_collection_freq().

* queryparsertest: Add testcase for FieldProcessor on boolean prefix with
  quoted contents.

* queryparsertest: Enable some disabled cases which actually work (in some
  cases with slightly tweaked expected answers which are equivalent to those
  that were shown).

* Make use of the new writable multidatabase feature to simplify the
  multi-database handling in the test harness.

* Change querypairwise1_helper to repeat the query build 100 times, as with a
  fast modern machine we were sometimes trying with so many subqueries that we
  would run out of stack.

* apitest: Use Xapian::Database::check() in cursordelbug1.  (partly addresses
  #238)

* apitest: Test Query ops with a single MatchAll subquery.

* apitest: New testcase readonlyparentdir1 to ensure that commit works with a
  read-only parent directory.

* tests/generate-api_generated: Test that the string returned by a
  get_description() method isn't empty.

* Use git commit hash in title of test coverage reports generated from a git
  tree.

* Make unittest use the test harness, so it gets all the valgrind and fd leak
  checks, and other handy features all the other tests have.

* Improve test coverage in several places.

* Compress generated HTML files in coverage report.


matcher:

* Fix stats passed to Weight with OP_SYNONYM.  Previously the number of
  unique terms was never calculated, and a term which matched all documents
  would be optimised to an all-docs postlist, which fails to supply the
  correct wdf info.

* Use floating point calculation for OR synonym freq estimates.  The division
  was being done as an integer division, which means the result was always
  getting rounded down rather than rounded to the nearest integer.

* Fix upper bound on matches for OP_XOR.  Due to a reversed conditional, the
  estimate could be one too low in some cases where the XOR matched all the
  documents in the database.

* Improve lower bound on matches for OP_XOR.  Previously the lower bound was
  always set to 0, which is valid, but we can often do better.

* Optimise value range which is a superset of the bounds.  If the value
  frequency is equal to the doccount, such a range is equivalent to MatchAll,
  and we now avoid having to read the valuestream at all.

* Optimise OP_VALUE_RANGE when the upper bound can't be exceeded.  In this
  case, we now use ValueGePostList instead of ValueRangePostList.

* Streamline collation of statistics for use by weighting schemes - tests show
  a 2% or so increase in speed in some cases.

* If a term matches all documents and its weight doesn't depend on its wdf, we
  can optimise it to MatchAll (the previous requirement that maxpart == 0 was
  unnecessarily strict).

* Fix the check for a term which matches all documents to use the sub-db
  termfreq, not the combined db termfreq.

* When we optimise a postlist for a term which matches all documents to use
  MatchAll, we still need to set a weight object on it to get percentages
  calculated correctly.

* Drop MatchNothing subqueries in OR-like situations in add_subquery() rather
  than adding them and then handling it later.

* Handle the left side of AND_NOT and AND_MAYBE being MatchNothing in
  add_subquery() rather than in done().

* Handle QueryAndLike with a MatchNothing subquery in add_subquery() rather
  than done().

* Query: Multi-way operators now store their subquery pointers in a custom
  class rather than std::vector<Xapian::Query>.  The custom class take the
  same amount of space, or often less.  It's particularly efficient when
  there are two subqueries, which is very desirable as we no longer flatten a
  subtree of the same operator as we build the query.

* Optimise an unweighted query term which matches all the documents in a
  subdatabase to use the "MatchAll" postlist.  (ticket#387)


glass backend:

* Fix allterms with prefix on glass with uncommitted changes.  Glass aims to
  flush just the relevant postlist changes in this case but the end of the
  range to flush was wrong, so we'd only actually flush changes for a term
  exactly matching the prefix.  Fixes #721.

* Fix Database::check() parsing of glass changes file header.  In practice this
  was unlikely to actually cause problems.

* Make glass the default backend.  The format should now be stable, except
  perhaps in the unlikely event that a bug emerges which requires a format
  change to address.

* Don't explicitly store the 2 byte "component_of" counter for the first
  component of every Btree entry in leaf blocks - instead use one of the upper
  bits of the length to store a "first component" flag.  This directly saves 2
  bytes per entry in the Btree, plus additional space due to fewer blocks and
  fewer levels being needed as a result.  This particularly helps the position
  table, which has a lot of entries, many of them very small.  The saving would
  be expected to be a little less than the saving from the change which shaved
  2 bytes of every Btree item in 1.3.4 (since that saved 2 bytes multiple times
  for large entries which get split into multiple items).  A simple test
  suggests a saving of several percent in total DB size, which fits that.  This
  change reduces the maximum component size to 8194, which affects tables
  with a 64KB blocksize in normal use and tables with >= 16KB blocksize with
  full compaction.

* Refactor glass backend key comparison - == and < operations are replaced by
  a compare() function returns negative, 0 or positive (like strcmp(), memcmp()
  and std::string::compare()).  This allows us to avoid a final compare to
  check for equality when binary chopping, and to terminate early if the binary
  chop hits the exact entry.

* If a cursor is moved to an entry which doesn't exist, we need to step back to
  the first component of previous entry before we can read its tag.  However we
  often don't actually read its tag (e.g. if we only wanted the key), so make
  this stepping back lazy so we can avoid doing it when we don't want to read
  the tag.

* Avoid creating std::string objects to hold data when compressing and
  decompressing tags with zlib.

* Store minimum compression length per table in the version file, with 0
  meaning "don't compress".  Currently you can only change this setting with a
  hex editor on the file, but now it is there we can later make use of it
  without needing a database format change.

* Database::check() now performs additional consistency checks for glass.
  Reported by Jean-Francois Dockes and Bob Cargill via xapian-discuss.

* Database::check(): check docids don't exceed db_last_docid when checking
  a single glass table.

* We now throw DatabaseCorruptError in a few cases where it's appropriate
  but we didn't previously, in particular in the case where all the files in a
  DB have been truncated to zero size (which makes handling of this case
  consistent with chert).

* Fix compaction to a single file which already exists.  This was hanging.
  Noted by Will Greenberg on #xapian.

* Shave 2 bytes of every Btree item (which will probably typically reduce
  database size by several percent).

* More compact item format for branch blocks - 2 bytes per item smaller.  This
  means each branch block can branch more ways, reducing the number of Btree
  levels needed, which is especially helpful for cold-cache search times.

* Track an upper bound on spelling word frequency.  This isn't currently used,
  but will be useful for improving the spelling algorithm, and we want to
  stabilise the glass backend format.  See #225, reported by Philip Neustrom.

* Support 64-bit docids in the glass backend on-disk format.  This changes the
  encoding used by pack_uint_preserving_sort() to one which supports 64 bit
  values, and is a byte smaller for values 16384-32767, and the same size for
  all other 32 bit values.  Fixes #686, from original report by James Aylett.

* Use memcpy() not memmove() when no risk of overlap.

* Store length of just the key data itself, allowing keys to be up to 255 bytes
  long - the previous limit was 252.

* Change glass to store DB stats in the version file.  Previously we stored
  them in a special item in the postlist table, but putting them in the version
  file reduces the number of block reads required to open the database, is
  simpler to deal with, and means we can potentially recalculate tight upper
  and lower bounds for an existing database without having to commit a new
  revision.

* Add support for a single-file variant for glass.  Currently such databases
  can only be opened for reading - to create one you need to use
  xapian-compact (or its API equivalent).  You can embed such databases within
  another file, and open them by passing in a file descriptor open on that file
  and positioned at the offset the database starts at).  Database::check() also
  supports them.  Fixes #666, reported by Will Greenberg (and previously
  suggested on xapian-discuss by Emmanuel Engelhart).

* Avoid potential DB corruption with full-compaction when using 64K blocks.

* Where posix_fadvise() is available, use it to prefetch postlist Btree blocks
  from the level below the root block which will be needed for postlists of
  terms in the query, and similarly for the docdata table when MSet::fetch() is
  called.  Based on patch by Will Greenberg in #671.

* When reporting freelist errors during a database check, distinguish between a
  block in use and in the freelist, and a block in the freelist more than once.

* Fix compaction and database checking for the change to the format of keys
  in the positionlist table which happened in 1.3.2.

* After splitting a block, we always insert the new block in the parent right
  after the block it was split from - there's no need to binary chop.

* Avoid infinite recursion when we hit the end of the freelist block we're
  reading and the end of the block we're writing at the same time.

* Fix freelist handling to allow for the newly loaded first block of the
  freelist being already used up.

* 'brass' backend renamed to 'glass' - we decided to use names in ascending
  alphabetical order to make it easier to understand which backend is newest,
  and since 'flint' was used recently, we skipped over 'd', 'e' and 'f'.

* Change positionlist keys to be ordered by term first rather than docid first,
  which helps phrase searching significantly.  For more efficient indexing,
  positionlist changes are now batched up in memory and written out in key
  order.

* Use a separate cursor for each position list - now we're ordering the
  position B-tree by term first, phrase matching would cause a single cursor
  to cycle between disparate areas of the B-tree and reread the same blocks
  repeatedly.

* Reference count blocks in the btree cursor, so cursors can cheaply share
  blocks.  This can significantly reduce the amount of memory used by cursors
  for queries which contain a lot of terms (e.g. wildcards which expand to a
  lot of terms).

* Under glass, optimise the turning of a query into a postlist to reuse the
  cursor blocks which are the same as the previous term's postlist.  This is
  particularly effective for a wildcard query which expands to a lot of terms.

* Keep track of unused blocks in the Btrees using freelists rather than
  bitmaps.  (fixes #40)

* Eliminate the base files, and instead store the root block and freelist
  pointers in the "iamglass" file.

* When compacting, sync all the tables together at the end.

* In DB_DANGEROUS mode, update the version file in-place.

* Only actually store the document data if it is non-empty.  The table which
  holds the document data is now lazily created, so won't exist if you never
  set the document data.

* Iterating positional data now decodes it lazily, which should speed up
  phrases which include common words.

* Compress changesets in brass replication. Increments the changeset version.
  Ticket #348

* Restore two missing lines in database checking where we report a block with
  the wrong level.

* When checking if a block was newly allocated in this revision, just look
  at its revision number rather than consulting the base file's bitmap.


remote backend:

* Improve handling of invalid remote stub entries: Entries without a colon now
  give an error rather than being quietly skipped; IPv6 isn't yet supported,
  but entries with IPv6 addresses now result in saner errors (previously the
  colons confused the code which looks for a port number).

* Fix hook for remote support of user weighting schemes.  The commented-out
  code used entirely the wrong class - now we use the server object we have
  access to, and forward the method to the class which needs it.

* Avoid dividing zero by zero when calculating the average length for an empty
  database.

* Bump remote protocol version to 38.0, due to extra statistics being tracked
  for weighting.

* Make Weight::Internal track if any max_part values are set, so we don't need
  to serialise them when they've not been set.

* Prefix compress list of terms and metadata keys in the remote protocol.
  This requires a remote protocol major version bump.

* When propagating exceptions from a remote backend server, the protocol now
  sends a numeric code to represent which exception is being propagated, rather
  than the name of the type, as a number can be turned back into an exception
  with a simple switch statement and is also less data to transfer.
  (ticket#471)

* Remote protocol (these changes require a protocol major version bump):

  + Unify REPLY_GREETING and REPLY_UPDATE.

  + Send (last_docid - doccount) instead of last_docid and (doclen_ubound -
    doclen_lbound) instead of doclen_ubound.

* Remove special check which gives a more helpful error message when a modern
  client is used against a remote server running Xapian <= 0.9.6.


chert backend:

* When using 64-bit Xapian::docid, consistently use the actual maximum valid
  docid value rather instead of the maximum value the type can hold.

* Where posix_fadvise() is available, use it to prefetch postlist Btree blocks
  from the level below the root block which will be needed for postlists of
  terms in the query, and similarly for the record table when MSet::fetch() is
  called.  Based on patch by Will Greenberg in #671.

* Fix problems with get_unique_terms() on a modified chert database.

* Fix xapian-check on a single chert table, which seg faulted in 1.3.2.

* Improve DBCHECK_FIX:

  + if fixing a whole database, we now take the revision from the first table
    we successfully look at, which should be correct in most cases, and is
    definitely better than trying to determine the revision of each broken
    table independently.

  + handle a zero-sized .DB file.

  + After we successfully regenerate baseA, remove any empty baseB file to
    prevent it causing problems.  Tracked down with help from Phil Hands.

* Iterating positional data now decodes it lazily, which should speed up
  phrases which include common words.


flint backend:

* Remove flint backend.
@ivmarkp ivmarkp deleted the bm25+ branch December 21, 2016 13:37
richhiey pushed a commit to richhiey/xapian that referenced this pull request Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants