Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BM25 scoring more flexible #1855

Merged
merged 4 commits into from
Feb 16, 2023

Conversation

alexcole
Copy link
Contributor

@alexcole alexcole commented Feb 9, 2023

This diff makes a few changes to expand the ways developers can use tantivy's BM25 scoring:

  1. Make Bm25Weight public.
  2. Remove the dependency of Bm25Weight on a Searcher. Instead it expects a Bm25StatisticsProvider (which is by default a Searcher).
  3. Allow specifying a custom Bm25StatisticsProvider when running a query.

Some example use cases this unlocks:

  1. Using Bm25Weight without the rest of tantivy.
    • Having a correct BM25 implementation is tricky! The Bm25Weight is also optimized to cache based on fieldnorm which is nice.
    • Now it has no dependency on actually storing documents in tantivy to use.
  2. Adjusting the statistics to include updates not yet saved to tantivy.
    • If a developer is using tantivy as a part of a larger system, they may not store every document to tantivy immediately when the data is created. For example they may store recent documents in memory and only periodically sync them to tantivy. Or there may be documents that are created in a transaction that hasn't committed yet (so they shouldn't be saved to tantivy yet).
    • In these cases, the Bm25StatisticsProvider should be the Searcher with adjustments for the data not yet saved to tantivy.
  3. Adjust the statistics to account for deletes
    • Currently the numbers of documents and tokens include deleted ones which is somewhat surprising. Being able to customize the Bm25StatisticsProvider opens the door to storing additional files to correctly track the number of undeleted documents and tokens.

My company (Convex) is using (1) and (2) and maybe some day we'll want to do (3) as well.

Let me know if any of this doesn't make sense! Also happy to change the implementation if there is a better way to accomplish these goals.

@fulmicoton
Copy link
Collaborator

@alexcole thanks for the contribution... espeically taking time to properly explain what your use case is.
@shikhar can you review?

@fulmicoton fulmicoton merged commit f2f38c4 into quickwit-oss:main Feb 16, 2023
Copy link
Collaborator

@shikhar shikhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, LGTM as well! Sorry about the delay.

adrianpasternak pushed a commit to adrianpasternak/tantivy that referenced this pull request May 15, 2023
* Introduce Bm25StatisticsProvider to inject statistics

* fix formatting I accidentally changed

(cherry picked from commit f2f38c4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants