Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

triandco · 2024-07-19T08:15:26Z

I created a repo to evaluate Tantivy retrieval using measurement like ndcg, map, and recall. I'm following Beir method for this evaluation. In the project, I use tantivy to index and retrieve document from multiple datasets. The retrieval result is saved in a tsv and then loaded into python for scoring with pytrec_eval (which is what Beir is built upon).

Currently, my current result is suspiciously low in comparison to the baseline BM25-flat published on Beir leaderboard.

Dataset	tantivy ndcg@10	Beir BM25flat ndcg@10
Scifact	0.6251573122952132	0.679
NFCorpus	0.20505084876906404	0.322
TREC-COVID	0.0362915780899568	0.595

I was following tantivy example to index and search, not sure if this is the best way?

If you can have a look let me know if there's anything wonky with my retrieval implementation, that would be very much appreciated.

fulmicoton · 2024-07-21T01:08:28Z

I don't have time to debug this thing. One thing you can do is pick one specific example where tantivy is outperformed by BM25, and use "explain".

The usual suspects are

your evaluation code
tokenization
the way query are sanitized
BM25 constants

Tantivy has an explain function telling precisely how tantivy came up with a given score.
(The formula should be the same a lucene.)

https://docs.rs/tantivy/latest/tantivy/query/trait.Query.html#method.explain

triandco · 2024-07-23T08:04:05Z

Thank you @fulmicoton, I've gone through each of the usual suspects and verify each. I also ran the task against Lucene which yielded more similar scores. Look like this is working as intended.

Dataset	Tantivy ndcg@10	Apache Lucene ndcg@10	Beir BM25 Flat ndcg@10
Scifact	0.6251573122952132	0.632431156289918	0.679
NFCorpus	0.20505084876906404	0.20712280950112716	0.322
TREC-COVID	0.0362915780899568	0.035369826134136535	0.595
NQ	0.2637953053727399	0.2803606345656689	0.306

Regarding why Beir got such highscore, their "BM25" retrieval task is just a wrapper around ElasticSearch. I'm evaluating ElasticSearch now, will update the result soon.

triandco · 2024-07-24T10:35:53Z

Updated result with ElasticSearch evaluation and increase the retrieval task complexity from single field to multifield. The current result look reasonable as ElasticSearch default to do a bit more than BM25. I'll contact Beir about their specifics test since their result look a bit too pretty. I'll close this issue. Thank you @fulmicoton!

Dataset	Tantivy	Apache Lucene	Beir BM25 Flat	Elastic Search
Scifact	0.6110550406527024	0.6105774540257333	0.679	0.6563018879997284
NFCorpus	0.20174488628325865	0.2021653197430468	0.322	0.2116375800036891
TREC-COVID	0.03640657024103224	0.03705072222267741	0.595	0.05433894833185797
NQ	0.30181710921729077	0.301753090384626	0.306	0.310128528137924

triandco changed the title ~~Tantivy NDCG Retrieval Evaluation (BEIR)~~ Tantivy NDCG Benchmarking Information Retrieval (BEIR) Jul 22, 2024

triandco closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

triandco commented Jul 19, 2024 •

edited

Loading

fulmicoton commented Jul 21, 2024 •

edited

Loading

triandco commented Jul 23, 2024

triandco commented Jul 24, 2024

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

Tantivy NDCG Benchmarking Information Retrieval (BEIR) #2455

Comments

triandco commented Jul 19, 2024 • edited Loading

fulmicoton commented Jul 21, 2024 • edited Loading

triandco commented Jul 23, 2024

triandco commented Jul 24, 2024

triandco commented Jul 19, 2024 •

edited

Loading

fulmicoton commented Jul 21, 2024 •

edited

Loading