Agentic search benchmarks on search datasets.
How well can an agent search with just a few basic retrieval tools?
Retrieval using different models. With different sets of tools. Below you'll see some combination of:
- e5-base-v2
- minilm
- bm25 w/ std params on title and description fields.
All sharing basically the same prompt. Click each strategy to see prompt + tools used. A tool-calling loop will run through once, produce the best ranked results. We'll then evaluate those.
Baselines first, agentic sorted by NDCG ascending. N=1000 queries.
ESCI is Amazon's Shopping Queries dataset for product search relevance with graded labels. Source
| strategy | model | mean | median |
|---|---|---|---|
| bm25 | n/a | 0.2895 | 0.1707 |
| embedding_minilm | n/a | 0.2304 | 0.0854 |
| embedding_e5 | n/a | 0.3142 | 0.2250 |
| agentic_minilm_ecommerce_gpt5_mini | gpt-5-mini | 0.2952 | 0.1749 |
| agentic_e5_ecommerce_gpt5_mini | gpt-5-mini | 0.3587 | 0.3181 |
| agentic_bm25_ecommerce_gpt5_mini | gpt-5-mini | 0.3850 | 0.3414 |
| agentic_bm25_minilm_ecommerce_gpt5_mini | gpt-5-mini | 0.3958 | 0.3414 |
| agentic_bm25_e5_ecommerce_gpt5_mini | gpt-5-mini | 0.4101 | 0.3743 |
| agentic_bm25_e5_ecommerce_gpt5 | gpt-5 | 0.4535 | 0.4417 |
Baselines first, agentic sorted by NDCG ascending.
WANDS is Wayfair's product search relevance dataset with graded judgments. Source
| strategy | model | mean | median |
|---|---|---|---|
| bm25 | n/a | 0.5408 | 0.4746 |
| embedding_minilm | n/a | 0.5060 | 0.4083 |
| embedding_e5 | n/a | 0.5571 | 0.5475 |
| agentic_minilm_ecommerce_gpt5_mini | gpt-5-mini | 0.5330 | 0.4874 |
| agentic_e5_ecommerce_gpt5_mini | gpt-5-mini | 0.5744 | 0.5609 |
| agentic_bm25_ecommerce_gpt5_mini | gpt-5-mini | 0.5803 | 0.5609 |
| agentic_bm25_minilm_ecommerce_gpt5_mini | gpt-5-mini | 0.5867 | 0.5609 |
| agentic_bm25_e5_ecommerce_gpt5_mini | gpt-5-mini | 0.5980 | 0.5609 |
| agentic_bm25_e5_ecommerce_gpt5 | gpt-5 | 0.6171 | 0.6256 |
Baselines first, agentic sorted by MRR ascending.
| strategy | model | mean | median |
|---|---|---|---|
| bm25_msmarco | n/a | 0.4913 | 0.3333 |
| embedding_e5_msmarco | n/a | 0.6983 | 1.0000 |
| agentic_msmarco_bm25_gpt5_mini | gpt-5-mini | 0.4647 | 0.3333 |
| agentic_msmarco_bm25_e5_gpt5_mini | gpt-5-mini | 0.5689 | 0.5000 |
| agentic_msmarco_e5_gpt5_mini | gpt-5-mini | 0.6123 | 1.0000 |
Below we force the agent to make at least 4 calls to a retrieval backend, with two different enforcements on repeat queries: direct equivalence (after lowercasing, etc.) and semantic similarity.
| strategy | model | mean | median |
|---|---|---|---|
| agentic_bm25_e5_ecommerce_gpt5_mini | gpt-5-mini | 0.4101 | 0.3743 |
| agentic_bm25_e5_ecommerce_4calls_repeat_gpt5_mini | gpt-5-mini | 0.4290 | 0.3948 |
| agentic_bm25_e5_ecommerce_4calls_sim0p9_gpt5_mini | gpt-5-mini | 0.4236 | 0.4258 |
uv run run --strategy configs/ecom_base/agentic_ecom_minilm_gpt5_mini.yml --dataset wands --num-queries 1000



