BM25 Benchmarks

Benchmarking

To run benchmark on bm25 implementations, simply run:

# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"

# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"

# for Pyserini
python -m benchmark.on_pyserini -d "<dataset>"

# For elastic, After starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"

# for PISA
python -m benchmark.on_pisa -d "<dataset>"

where <dataset> is the name of the dataset to be used.

Available datasets

The available datasets are public BEIR datasets: trec-covid, nfcorpus, fiqa, arguana, webis-touche2020, quora, scidocs, scifact, cqadupstack, nq, msmarco, hotpotqa, dbpedia-entity, fever, climate-fever,

Sampling during benchmarking

For rank-bm25, due to the long runtime, we can sample queries

python -m benchmark.on_rank_bm25 -d "<dataset>" --samples <num_samples>

Rank-bm25 variants

For rank-bm25, we can also specify the method with --method to be used:

rank (default)
bm25l
bm25+

Results will be saved in results/ directory.

Elasticsearch server

If you want to use elastic search, you need to start the server first.

First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz. Extract the file and ensure it is in the same directory as the bm25-benchmarks directory.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz

Then, start the server with the following command:

./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1

Results

The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.

The shorthands used are:

BM25PT for bm25_pt
PSRN for pyserini
R-BM25 for rank-bm25
ES for elasticsearch
PISA for the Pisa Engine (via the pyterrier_pisa Python bindings)
OOM for out-of-memory error
DNT for did not terminate (i.e. went over 12 hours)

Queries per second

dataset	PISA	BM25S	ES	PSRN	PT	Rank
arguana	128.01	573.91	13.67	11.95	110.51	2
climate-fever	25.66	13.09	4.02	8.06	OOM	0.03
cqadupstack	281.65	170.91	13.38	DNT	OOM	0.77
dbpedia-entity	114.05	13.44	10.68	12.69	OOM	0.11
fever	60.76	20.19	7.45	10.52	OOM	0.06
fiqa	546.61	717.78	16.96	12.51	20.52	4.46
hotpotqa	39.70	20.88	7.11	10.41	OOM	0.04
msmarco	125.18	12.2	11.88	11.01	OOM	0.07
nfcorpus	2639.38	1196.16	45.84	32.94	256.67	224.66
nq	117.34	41.85	12.16	11.04	OOM	0.1
quora	528.45	272.04	21.8	15.58	6.49	1.18
scidocs	678.06	767.05	17.93	14.1	41.34	9.01
scifact	1100.60	1317.12	20.81	15.02	184.3	47.6
trec-covid	140.92	85.64	7.34	8.53	3.73	1.48
webis-touche2020	201.19	60.59	13.53	12.36	OOM	1.1

Notes:

For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.

Show BM25S & ES multi-threaded (4T) performance (Q/s)

dataset	PISA	BM25S	ES
arguana	311.00	211	33.37
climate-fever	69.89	22.06	8.13
cqadupstack	694.82	248.87	27.76
dbpedia-entity	294.66	26.18	15.49
fever	166.77	47.03	14.07
fiqa	1240.22	449.82	36.33
hotpotqa	109.44	45.02	10.35
msmarco	340.29	21.64	18.19
nfcorpus	3188.22	784.24	81.07
nq	312.09	77.49	19.18
quora	1534.93	308.58	43.02
scidocs	1461.20	614.23	46.36
scifact	2620.73	645.88	50.93
trec-covid	268.96	100.88	13.5
webis-touche2020	297.79	202.39	26.55

Show normalized table wrt Rank-BM25

dataset	PISA	BM25S	ES	PSRN	PT	Rank
arguana	64.01	286.96	6.84	5.98	55.26	1
climate-fever	855.33	436.33	134	268.67	nan	1
cqadupstack	365.78	221.96	17.38	nan	nan	1
dbpedia-entity	1036.82	122.18	97.09	115.36	nan	1
fever	1012.67	336.5	124.17	175.33	nan	1
fiqa	122.56	160.94	3.8	2.8	4.6	1
hotpotqa	992.50	522	177.75	260.25	nan	1
msmarco	1788.29	174.29	169.71	157.29	nan	1
nfcorpus	11.75	5.32	0.2	0.15	1.14	1
nq	1173.40	418.5	121.6	110.4	nan	1
quora	447.84	230.54	18.47	13.2	5.5	1
scidocs	75.26	85.13	1.99	1.56	4.59	1
scifact	23.12	27.67	0.44	0.32	3.87	1
trec-covid	95.22	57.86	4.96	5.76	2.52	1
webis-touche2020	182.90	55.08	12.3	11.24	nan	1

Stats

	# Docs	# Queries	# Tokens
msmarco	8,841,823	6,980	340,859,891
hotpotqa	5,233,329	7,405	169,530,287
trec-covid	171,332	50	20,231,412
webis-touche2020	382,545	49	74,180,340
arguana	8,674	1,406	947,470
fiqa	57,638	648	5,189,035
nfcorpus	3,633	323	614,081
climate-fever	5,416,593	1,535	318,190,120
nq	2,681,468	3,452	148,249,808
scidocs	25,657	1,000	3,211,248
quora	522,931	10,000	4,202,123
dbpedia-entity	4,635,922	400	162,336,256
cqadupstack	457,199	13,145	44,857,487
fever	5,416,568	6,666	318,184,321
scifact	5,183	300	812,074

Indexing time (docs/s)

The following results follow the same setup as the queries/s benchmarks described above (single-core).

dataset	PISA	BM25S	ES	PSRN	PT	Rank
arguana	3608.75	4314.79	3591.63	1225.18	638.1	5021.3
climate-fever	5474.84	4364.43	3825.89	6880.42	nan	7085.51
cqadupstack	4380.73	4800.89	3725.43	nan	nan	5370.32
dbpedia-entity	9532.44	7576.28	6333.82	8501.7	nan	9110.36
fever	5633.20	4921.88	3879.63	7007.5	nan	5482.64
fiqa	4655.99	5959.25	4035.11	3735.38	421.51	6455.53
hotpotqa	10199.26	7420.39	5455.6	10342.5	nan	9407.9
msmarco	9873.08	7480.71	5391.29	9686.07	nan	12455.9
nfcorpus	2484.47	3169.4	1688.15	692.05	442.2	3579.47
nq	6923.50	6083.86	5742.13	6652.33	nan	6048.85
quora	37954.43	28002.4	8189.75	22818.5	6251.26	47609.2
scidocs	3076.90	4107.46	3008.45	2137.64	312.72	4232.15
scifact	2560.91	3253.63	2649.57	880.53	442.61	3792.84
trec-covid	4736.72	4600.14	2966.98	3768.1	406.37	4672.62
webis-touche2020	2301.53	2971.96	2484.87	2718.41	nan	3115.96

NDCG@10

We use abbreviations for datasets of BEIR benchmarks.

Click to show dataset abbreviations

AG for arguana
CD for cqadupstack
CF for climate-fever
DB for dbpedia-entity
FQ for fiqa
FV for fever
HP for hotpotqa
MS for msmarco
NF for nfcorpus
NQ for nq
QR for quora
SD for scidocs
SF for scifact
TC for trec-covid
WT for webis-touche2020

k1	b	method	Avg.	AG	CD	CF	DB	FQ	FV	HP	MS	NF	NQ	QR	SD	SF	TC	WT
0.9	0.4	Lucene	41.1	40.8	28.2	16.2	31.9	23.8	63.8	62.9	22.8	31.8	30.5	78.7	15.0	67.6	58.9	44.2
1.2	0.75	ATIRE	39.9	48.7	30.1	13.7	30.3	25.3	50.3	58.5	22.6	31.8	29.1	80.5	15.6	68.1	61.0	33.2
1.2	0.75	BM25+	39.9	48.7	30.1	13.7	30.3	25.3	50.3	58.5	22.6	31.8	29.1	80.5	15.6	68.1	61.0	33.2
1.2	0.75	BM25L	39.5	49.6	29.8	13.5	29.4	25.0	46.6	55.9	21.4	32.2	28.1	80.3	15.8	68.7	62.9	33.0
1.2	0.75	Lucene	39.9	48.7	30.1	13.7	30.3	25.3	50.3	58.5	22.6	31.8	29.1	80.5	15.6	68.0	61.0	33.2
1.2	0.75	Robertson	39.9	49.2	29.9	13.7	30.3	25.4	50.3	58.5	22.6	31.9	29.2	80.4	15.5	68.3	59.0	33.8
1.5	0.75	ES	42.0	47.7	29.8	17.8	31.1	25.3	62.0	58.6	22.1	34.4	31.6	80.6	16.3	69.0	68.0	35.4
1.5	0.75	Lucene	39.7	49.3	29.9	13.6	29.9	25.1	48.1	56.9	21.9	32.1	28.5	80.4	15.8	68.7	62.3	33.1
1.5	0.75	PSRN	40.0	48.4	29.8	14.2	30.0	25.3	50.0	57.6	22.1	32.6	28.6	80.6	15.6	68.8	63.4	33.5
1.5	0.75	PT	45.0	44.9	--	--	--	22.5	--	--	--	31.9	--	75.1	14.7	67.8	58.0	--
1.5	0.75	Rank	39.6	49.5	29.6	13.6	29.9	25.3	49.3	58.1	21.1	32.1	28.5	80.3	15.8	68.5	60.1	32.9
1.2	0.75	PISA	38.8	41.2	27.8	13.8	30.6	24.6	49.1	58.1	22.6	34.4	28.2	72	15.7	68.8	63.7	31.1

Recall@1000

k1	b	method	Avg.	AG	CD	CF	DB	FQ	FV	HP	MS	NF	NQ	QR	SD	SF	TC	WT
0.9	0.4	Lucene	77.3	98.8	71.1	63.3	67.5	74.3	95.7	88.0	85.3	47.7	89.6	99.5	56.5	97.0	39.2	86.0
1.2	0.75	ATIRE	77.4	99.3	73.0	59.0	67.0	76.5	94.2	86.8	85.7	47.8	89.8	99.5	57.3	97.0	40.3	87.2
1.2	0.75	BM25+	77.4	99.3	73.0	59.0	67.0	76.5	94.2	86.8	85.7	47.8	89.8	99.5	57.3	97.0	40.3	87.2
1.2	0.75	BM25L	77.2	99.4	73.4	57.3	66.1	77.3	93.7	85.7	85.0	47.7	89.3	99.5	57.7	97.0	40.8	87.5
1.2	0.75	Lucene	77.4	99.3	73.0	59.0	67.0	76.5	94.2	86.8	85.6	47.8	89.8	99.5	57.3	97.0	40.3	87.2
1.2	0.75	Robertson	77.4	99.3	73.2	59.1	66.7	76.8	94.2	86.8	85.9	47.5	89.8	99.5	57.3	96.7	40.2	87.4
1.5	0.75	ES	76.9	99.2	74.2	58.8	63.6	76.7	95.9	85.2	85.1	39.0	90.8	99.6	57.9	98.0	41.3	88.0
1.5	0.75	Lucene	77.2	99.3	73.3	57.8	66.3	77.2	93.8	86.1	85.2	47.7	89.5	99.6	57.5	97.0	40.6	87.4
1.5	0.75	PSRN	76.7	99.2	74.2	58.7	66.2	76.7	94.2	86.4	85.1	37.1	89.4	99.6	57.4	97.7	41.1	87.2
1.5	0.75	PT	73.0	98.3	--	--	--	72.5	--	--	--	51.0	--	98.9	56.0	97.8	36.3	--
1.5	0.75	Rank	77.1	99.4	73.4	57.5	66.4	77.4	93.6	87.7	82.6	47.6	89.5	99.5	57.4	96.7	40.5	87.5
1.2	0.75	PISA	77.1	98.8	72.2	59.8	67.6	76.4	93.7	86.8	86.9	38.7	89.1	98.9	56.9	97	46	87.7

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
analysis		analysis
benchmark		benchmark
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-bm25-pt.txt		requirements-bm25-pt.txt
requirements-pisa.txt		requirements-pisa.txt
requirements-pyserini.txt		requirements-pyserini.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BM25 Benchmarks

Benchmarking

Available datasets

Sampling during benchmarking

Rank-bm25 variants

Elasticsearch server

Results

Queries per second

Stats

Indexing time (docs/s)

NDCG@10

Recall@1000

Links

About

Releases

Packages

Contributors 2

Languages

License

xhluca/bm25-benchmarks

Folders and files

Latest commit

History

Repository files navigation

BM25 Benchmarks

Benchmarking

Available datasets

Sampling during benchmarking

Rank-bm25 variants

Elasticsearch server

Results

Queries per second

Stats

Indexing time (docs/s)

NDCG@10

Recall@1000

Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages