Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

Closed
triandco opened this issue Jul 19, 2024 · 3 comments
Closed

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

triandco opened this issue Jul 19, 2024 · 3 comments

Comments

@triandco
Copy link

triandco commented Jul 19, 2024

I created a repo to evaluate Tantivy retrieval using measurement like ndcg, map, and recall. I'm following Beir method for this evaluation. In the project, I use tantivy to index and retrieve document from multiple datasets. The retrieval result is saved in a tsv and then loaded into python for scoring with pytrec_eval (which is what Beir is built upon).

Currently, my current result is suspiciously low in comparison to the baseline BM25-flat published on Beir leaderboard.

Dataset tantivy ndcg@10 Beir BM25flat ndcg@10
Scifact 0.6251573122952132 0.679
NFCorpus 0.20505084876906404 0.322
TREC-COVID 0.0362915780899568 0.595

I was following tantivy example to index and search, not sure if this is the best way?

If you can have a look let me know if there's anything wonky with my retrieval implementation, that would be very much appreciated.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jul 21, 2024

I don't have time to debug this thing. One thing you can do is pick one specific example where tantivy is outperformed by BM25, and use "explain".

The usual suspects are

  • your evaluation code
  • tokenization
  • the way query are sanitized
  • BM25 constants

Tantivy has an explain function telling precisely how tantivy came up with a given score.
(The formula should be the same a lucene.)

https://docs.rs/tantivy/latest/tantivy/query/trait.Query.html#method.explain

@triandco triandco changed the title Tantivy NDCG Retrieval Evaluation (BEIR) Tantivy NDCG Benchmarking Information Retrieval (BEIR) Jul 22, 2024
@triandco
Copy link
Author

Thank you @fulmicoton, I've gone through each of the usual suspects and verify each. I also ran the task against Lucene which yielded more similar scores. Look like this is working as intended.

Dataset Tantivy ndcg@10 Apache Lucene ndcg@10 Beir BM25 Flat ndcg@10
Scifact 0.6251573122952132 0.632431156289918 0.679
NFCorpus 0.20505084876906404 0.20712280950112716 0.322
TREC-COVID 0.0362915780899568 0.035369826134136535 0.595
NQ 0.2637953053727399 0.2803606345656689 0.306

Regarding why Beir got such highscore, their "BM25" retrieval task is just a wrapper around ElasticSearch. I'm evaluating ElasticSearch now, will update the result soon.

@triandco
Copy link
Author

Updated result with ElasticSearch evaluation and increase the retrieval task complexity from single field to multifield. The current result look reasonable as ElasticSearch default to do a bit more than BM25. I'll contact Beir about their specifics test since their result look a bit too pretty. I'll close this issue. Thank you @fulmicoton!

Dataset Tantivy Apache Lucene Beir BM25 Flat Elastic Search
Scifact 0.6110550406527024 0.6105774540257333 0.679 0.6563018879997284
NFCorpus 0.20174488628325865 0.2021653197430468 0.322 0.2116375800036891
TREC-COVID 0.03640657024103224 0.03705072222267741 0.595 0.05433894833185797
NQ 0.30181710921729077 0.301753090384626 0.306 0.310128528137924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants