Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback
Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.
Critic-R requires Python 3.10+ and CUDA-capable GPUs. We recommend two separate conda environments — one for inference/serving and one for retriever training.
conda create -n sr1 python=3.10 -y
conda activate sr1
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.6.3
pip install transformers==4.45.0 datasets accelerate
pip install aiohttp requests tqdm numpyconda create -n retriever python=3.10 -y
conda activate retriever
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.45.0 sentence-transformers
pip install faiss-gpu==1.7.2
pip install deepspeed
pip install xformerscd search-r1/search
bash build_index_stella400M.shThis produces a Flat FAISS index over wiki-18.jsonl using mean-pooled Stella embeddings (max_length=256, fp16). Swap --faiss_type Flat for HNSW32/64/128 to use approximate search, or set --retrieval_method bm25 for a lexical baseline.
A full Critic-R inference run involves three services and one driver script:
- Critic LLM server (vLLM, OpenAI-compatible API).
- Dense retrieval server (FAISS over Wikipedia-18).
- Driver, which hosts the reasoning agent locally and orchestrates the loop.
# (1) Launch the critic LLM
sbatch vllm_server_job_multigpu.sh # -> http://<host>:8001/v1/chat/completions
# (2) Launch the retrieval server
sbatch retrieval_launch__stella400M.sh # -> http://<host>:8000/retrieve
# (3) Run the Critic-R loop
cd search-r1/critic-r
bash run.shIf you find this work useful, please cite:
@misc{alam2026criticrimprovingagenticsearch,
title={Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback},
author={Md Zarif Ul Alam and Alireza Salemi and Hamed Zamani},
year={2026},
eprint={2606.00590},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2606.00590},
}