Download dataset:
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/quora.zip
mkdir -p data
mv quora.zip data/
cd data
unzip quora.zip
pip install -r requirements.txt
(Note: for gpu inference see fastembed)
Bm25 version uses tantivy
library for indexing and search.
python index_bm25.py
python evaluate-bm25.py
Results we got:
Total hits: 12065 out of 15675, which is 0.7696969696969697
Precision: 0.12065
Average precision: 0.12065
Average recall: 0.8952571817831299
Additionally, we compare pure sparse vectors implementation with BM25. It uses exactly the same tokenizer and stemmer as BM42, which provides a more fair comparison.
# Run qdrant
docker run --rm -d --network=host qdrant/qdrant:v1.10.0
python index_bm25_qdrant.py
python evaluate-bm25-qdrant.py
Results we got:
Total hits: 11151 out of 15675, which is 0.7113875598086125
Precision: 0.11151
Average precision: 0.1115100000000054
Average recall: 0.8321873943359426
BM42 uses fastembed
implementation for inference, and qdrant
for indexing and search.
IDF are calculated using inside Qdrant.
# Run qdrant
docker run --rm -d --network=host qdrant/qdrant:v1.10.0
python index_bm42.py
python evaluate-bm42.py
Results we got:
Total hits: 11488 out of 15675, which is 0.7328867623604466
Precision: 0.11488
Average precision: 0.11488000000000238
Average recall: 0.8515208038970792