Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend scoring hits with BM25(k1=0.9,b=0.4). #46

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jpountz
Copy link

@jpountz jpountz commented Sep 25, 2023

Currently different engines use different parameters for BM25, e.g. Tantivy and Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman et al. later suggested that 0.9/0.4 would make better defaults and this seems to be the consensus nowadays.

The ranking function matters because it affects which hits may be skipped via dynamic pruninng, which in-turn affects search performance.

Closes #45

Currently different engines use different parameters for BM25, e.g. Tantivy and
Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had
initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman
et al. later suggested that 0.9/0.4 would make better defaults and this seems
to be the consensus nowadays.

The ranking function matters because it affects which hits may be skipped via
dynamic pruninng, which in-turn affects search performance.

Closes quickwit-oss#45
@jpountz
Copy link
Author

jpountz commented Sep 25, 2023

I believe that PISA does not require changes though it would be nice to make the BM25 configuration more explicit in the query logic, what do you think @amallia? I could use some help making that change as I'm not too familiar with the PISA API.

It looks like Tantivy supports configuring the ranking function, but I'm not proficient in Rust and could use some help there too.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR does not make the change for tantivy does it?

@jpountz
Copy link
Author

jpountz commented Sep 25, 2023

It does not indeed. I would like to change it but I am not familiar with Rust and unsure how to do it. I could use some help.

@jpountz
Copy link
Author

jpountz commented Oct 1, 2023

As a counterpoint, @rmuir pointed me to the DFR paper which shows that BM25 with k1=1.2/b=0.75 happens to closely match the parameter-free I(n)L2 model, giving further evidence that k1=1.2/b=0.75 are good defaults for BM25.

@jpountz
Copy link
Author

jpountz commented Oct 1, 2023

Separately I checked more search engines and IR toolkits:

  • PISA, Anserini, JASS, ATIRE use 0.9/0.4
  • Terrier, Vespa, Lucene, Tantivy use 1.2/0.75

So there doesn't really seem to be a consensus actually. The point from the DFR paper that theory meets practice with 1.2/0.75 is quite convincing. Unless I find more evidence that 0.9/0.4 is more effective, I am considering switching PISA to 1.2/0.75 instead of switching Lucene and Tantivy to 0.9/0.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use same BM25 k1/b parameters across engines.
2 participants