Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use same BM25 k1/b parameters across engines. #45

Open
jpountz opened this issue Sep 24, 2023 · 1 comment · May be fixed by #46
Open

Use same BM25 k1/b parameters across engines. #45

jpountz opened this issue Sep 24, 2023 · 1 comment · May be fixed by #46

Comments

@jpountz
Copy link

jpountz commented Sep 24, 2023

The k1 and b parameters of BM25 can influence what hits may be dynamically pruned and thus performance numbers, so it would be good to use the same values across engines. Currently it looks like engines use their own defaults, which seem to be k1=0.9 and b=0.4 for PISA, and k1=1.2 and b=0.75 for Lucene and Tantivy.

jpountz added a commit to jpountz/search-benchmark-game that referenced this issue Sep 25, 2023
Currently different engines use different parameters for BM25, e.g. Tantivy and
Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had
initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman
et al. later suggested that 0.9/0.4 would make better defaults and this seems
to be the consensus nowadays.

The ranking function matters because it affects which hits may be skipped via
dynamic pruninng, which in-turn affects search performance.

Closes quickwit-oss#45
@jpountz jpountz linked a pull request Sep 25, 2023 that will close this issue
@jpountz
Copy link
Author

jpountz commented Oct 2, 2023

To get a sense of the influence of these parameters on query performance, I compared Lucene-9.8 with 1.2/0.75 against 0.9/0.4 on the TOP_100 command. I'm getting:

  • 4.6% better latency on average for intersections with 0.9/0.4
  • 4.2% better latency on average for unions with 0.9/0.4

So it's not huge but significant and extremely consistent:

  • 7 queries get better latencies with 1.2/0.75
  • 2 queries get the same latencies
  • 893 queries get a better latency with 0.9/0.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant