Skip to content

yale-nlp/SciRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature

Paper

📖 Overview

SciRAG Framework Introduction

Existing baseline approaches for scientific question answering follow a single-threaded iterative retrieval strategy, limiting their ability to comprehensively address multi-faceted questions and leading to incomplete coverage and answer organization.

SciRAG addresses these limitations through a novel framework with three key capabilities:

  • Adaptive Retrieval: A Gap Critic mechanism automatically determines when additional retrieval is needed and uses tree-based query decomposition to enable parallel or sequential exploration of sub-questions, with citation graph expansion to enrich the retrieved paper set.
  • Symbolic Reasoning-Based Reranking: A three-step symbolic reasoning process analyzes paper relationships and contributions to intelligently rerank retrieved documents.
  • Outline-Guided Synthesis: Answers are synthesized through bottom-up aggregation along the query tree, guided by a structured outline to ensure comprehensive coverage and proper organization.

SciRAG achieves strong performance across both long-form literature review tasks (ScholarQA, QASA) and short-form answer tasks (SciFact, PubMedQA), demonstrating superior answer quality compared to existing baselines.

🏗️ Framework

SciRAG Complete Framework

🚀 Quickstart

📦 Installation

Create and activate the conda environment:

conda env create -f environment.yml
conda activate scirag

Remember to configure your API keys:

  • Set your Semantic Scholar API key in web_search.py
  • Set your LLM API key in config.yaml

▶️ Running the Pipeline

For long-form answers (ScholarQA, QASA):

cd longans
python run.py --input_file test.jsonl --output_file testout.jsonl --model_name gpt4o

For short-form answers (SciFact, PubMedQA):

cd shortans/scifact  # or shortans/pubmed
python run.py --input_file test.jsonl --output_file testout.jsonl --model_name gpt4o

🔍 Initial Dense Retrieval

Since dense retrieval from a large-scale corpus (45 million papers) can be resource-intensive, we recommend following the retrieval method from:

OpenScholar Retriever

This helps reduce the heavy resource cost during retrieval. Once retrieved, you can use the output file as the input to the SciRAG pipeline.

📊 Evaluation

For evaluation tools and benchmarks, please refer to:

ScholarQABench

✍️ Citation

If you use our work and are inspired by our work, please consider cite us:

@misc{ding2025sciragadaptivecitationawareoutlineguided,
      title={SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature}, 
      author={Hang Ding and Yilun Zhao and Tiansheng Hu and Manasi Patwardhan and Arman Cohan},
      year={2025},
      eprint={2511.14362},
      archivePrefix={arXiv},
      primaryClass={cs.DL},
      url={https://arxiv.org/abs/2511.14362}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •