Skip to content

softwaredoug/search-experiments

Repository files navigation

Search experiments

Agentic search benchmarks on search datasets.

How well can an agent search with just a few basic retrieval tools?

E-commerce datasets

The search tools

Retrieval using different models. With different sets of tools. Below you'll see some combination of:

All sharing basically the same prompt. Click each strategy to see prompt + tools used. A tool-calling loop will run through once, produce the best ranked results. We'll then evaluate those.

Amazon ESCI

Baselines first, agentic sorted by NDCG ascending. N=1000 queries.

ESCI is Amazon's Shopping Queries dataset for product search relevance with graded labels. Source

strategy model mean median
bm25 n/a 0.2895 0.1707
embedding_minilm n/a 0.2304 0.0854
embedding_e5 n/a 0.3142 0.2250
agentic_minilm_ecommerce_gpt5_mini gpt-5-mini 0.2952 0.1749
agentic_e5_ecommerce_gpt5_mini gpt-5-mini 0.3587 0.3181
agentic_bm25_ecommerce_gpt5_mini gpt-5-mini 0.3850 0.3414
agentic_bm25_minilm_ecommerce_gpt5_mini gpt-5-mini 0.3958 0.3414
agentic_bm25_e5_ecommerce_gpt5_mini gpt-5-mini 0.4101 0.3743
agentic_bm25_e5_ecommerce_gpt5 gpt-5 0.4535 0.4417

ESCI NDCG plot

ESCI tool calls pareto

Wayfair WANDS

Baselines first, agentic sorted by NDCG ascending.

WANDS is Wayfair's product search relevance dataset with graded judgments. Source

strategy model mean median
bm25 n/a 0.5408 0.4746
embedding_minilm n/a 0.5060 0.4083
embedding_e5 n/a 0.5571 0.5475
agentic_minilm_ecommerce_gpt5_mini gpt-5-mini 0.5330 0.4874
agentic_e5_ecommerce_gpt5_mini gpt-5-mini 0.5744 0.5609
agentic_bm25_ecommerce_gpt5_mini gpt-5-mini 0.5803 0.5609
agentic_bm25_minilm_ecommerce_gpt5_mini gpt-5-mini 0.5867 0.5609
agentic_bm25_e5_ecommerce_gpt5_mini gpt-5-mini 0.5980 0.5609
agentic_bm25_e5_ecommerce_gpt5 gpt-5 0.6171 0.6256

WANDS NDCG plot

WANDS tool calls pareto

MiniMSMARCO

Baselines first, agentic sorted by MRR ascending.

strategy model mean median
bm25_msmarco n/a 0.4913 0.3333
embedding_e5_msmarco n/a 0.6983 1.0000
agentic_msmarco_bm25_gpt5_mini gpt-5-mini 0.4647 0.3333
agentic_msmarco_bm25_e5_gpt5_mini gpt-5-mini 0.5689 0.5000
agentic_msmarco_e5_gpt5_mini gpt-5-mini 0.6123 1.0000

MiniMSMARCO MRR plot

ESCI - Forcing more tool calls

Below we force the agent to make at least 4 calls to a retrieval backend, with two different enforcements on repeat queries: direct equivalence (after lowercasing, etc.) and semantic similarity.

strategy model mean median
agentic_bm25_e5_ecommerce_gpt5_mini gpt-5-mini 0.4101 0.3743
agentic_bm25_e5_ecommerce_4calls_repeat_gpt5_mini gpt-5-mini 0.4290 0.3948
agentic_bm25_e5_ecommerce_4calls_sim0p9_gpt5_mini gpt-5-mini 0.4236 0.4258

Run a strategy

uv run run --strategy configs/ecom_base/agentic_ecom_minilm_gpt5_mini.yml --dataset wands --num-queries 1000

About

Various search experiments built on cheat-at-search library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors