SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature

📖 Overview

Existing baseline approaches for scientific question answering follow a single-threaded iterative retrieval strategy, limiting their ability to comprehensively address multi-faceted questions and leading to incomplete coverage and answer organization.

SciRAG addresses these limitations through a novel framework with three key capabilities:

Adaptive Retrieval: A Gap Critic mechanism automatically determines when additional retrieval is needed and uses tree-based query decomposition to enable parallel or sequential exploration of sub-questions, with citation graph expansion to enrich the retrieved paper set.
Symbolic Reasoning-Based Reranking: A three-step symbolic reasoning process analyzes paper relationships and contributions to intelligently rerank retrieved documents.
Outline-Guided Synthesis: Answers are synthesized through bottom-up aggregation along the query tree, guided by a structured outline to ensure comprehensive coverage and proper organization.

SciRAG achieves strong performance across both long-form literature review tasks (ScholarQA, QASA) and short-form answer tasks (SciFact, PubMedQA), demonstrating superior answer quality compared to existing baselines.

🏗️ Framework

🚀 Quickstart

📦 Installation

Create and activate the conda environment:

conda env create -f environment.yml
conda activate scirag

Remember to configure your API keys:

Set your Semantic Scholar API key in web_search.py
Set your LLM API key in config.yaml

▶️ Running the Pipeline

For long-form answers (ScholarQA, QASA):

cd longans
python run.py --input_file test.jsonl --output_file testout.jsonl --model_name gpt4o

For short-form answers (SciFact, PubMedQA):

cd shortans/scifact  # or shortans/pubmed
python run.py --input_file test.jsonl --output_file testout.jsonl --model_name gpt4o

🔍 Initial Dense Retrieval

Since dense retrieval from a large-scale corpus (45 million papers) can be resource-intensive, we recommend following the retrieval method from:

OpenScholar Retriever

This helps reduce the heavy resource cost during retrieval. Once retrieved, you can use the output file as the input to the SciRAG pipeline.

📊 Evaluation

For evaluation tools and benchmarks, please refer to:

ScholarQABench

✍️ Citation

If you use our work and are inspired by our work, please consider cite us:

@misc{ding2025sciragadaptivecitationawareoutlineguided,
      title={SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature}, 
      author={Hang Ding and Yilun Zhao and Tiansheng Hu and Manasi Patwardhan and Arman Cohan},
      year={2025},
      eprint={2511.14362},
      archivePrefix={arXiv},
      primaryClass={cs.DL},
      url={https://arxiv.org/abs/2511.14362}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
longans		longans
shortans		shortans
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature

📖 Overview

🏗️ Framework

🚀 Quickstart

📦 Installation

▶️ Running the Pipeline

🔍 Initial Dense Retrieval

📊 Evaluation

✍️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

yale-nlp/SciRAG

Folders and files

Latest commit

History

Repository files navigation

SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature

📖 Overview

🏗️ Framework

🚀 Quickstart

📦 Installation

▶️ Running the Pipeline

🔍 Initial Dense Retrieval

📊 Evaluation

✍️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages