Skip to content

InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising

License

Notifications You must be signed in to change notification settings

weizhepei/InstructRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstructRAG

Instructing Retrieval-Augmented Generation with Explicit Denoising
[arXiv] [Website] [Model] [Dataset] [X Summary]

InstructRAG is a simple yet effective RAG framework that allows LMs to explicitly denoise retrieved contents by generating rationales for better verifiability and trustworthiness.

InstructRAG Key Features:

  • 🤖 Self-Synthesis: Leverage instruction-tuned LMs to generate their OWN supervision for denoising.
  • 🔌 Easy-to-Use: Support both in-context learning (ICL) and supervised fine-tuning (SFT).
  • 🚀 Effectiveness: Up to 8.3% better results across 5 benchmarks (Table 5).
  • 💪 Noise Robustness: Robust to increased noise ratios in various scenarios (Figure 3).
  • 🔁 Task Transferability: InstructRAG can also solve out-of-domain unseen tasks (Figure 4).

Please see also our paper and X summary for more details.

🔗 Quick Links

Installation

Run the following script to create a Python virtual environment and install all required packages.

bash setup.sh

Alternatively, you can also directly create a conda environment using the provided configuration file.

conda env create -f environment.yml

Training Script

To train the model (i.e., InstructRAG-FT), just activate the environment and run the following training script. The training config is set for 4xH100 80G GPUs. You may need to adjust NUM_DEVICE and PER_DEVICE_BATCH_SIZE based on your computation environment.

conda activate instrag
bash train.sh

Evaluation

There are two instantiations of our framework:

  • InstructRAG-ICL: training-free & easy-to-adapt
  • InstructRAG-FT: trainable & better performance

Use the following script to evaluate InstructRAG in both training-free and trainable settings. You can specify the task and model by adjusting DATASET and MODEL in eval.sh.

conda activate instrag
bash eval.sh

Generation Example

The following case study shows that InstructRAG can effectively identify relevant information from noisy input and leverage its own knowledge to correctly answer questions when required. The red texts denote irrelevant or inaccurate model generations, while the green texts denote contents relevant to the question.

Model Checkpoints

Below is the full list of InstructRAG models fine-tuned on each dataset in our work.

Dataset HF Model Repo Retriever
PopQA meng-lab/PopQA-InstructRAG-FT Contriever
TriviaQA meng-lab/TriviaQA-InstructRAG-FT Contriever
Natural Questions meng-lab/NaturalQuestions-InstructRAG-FT DPR
ASQA meng-lab/ASQA-InstructRAG-FT GTR
2WikiMultiHopQA meng-lab/2WikiMultiHopQA-InstructRAG-FT BM25

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Zhepei (zhepei.wei@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you find the repo helpful in your work:

@article{wei2024instructrag,
  title={{InstructRAG}: Instructing Retrieval-Augmented Generation with Explicit Denoising},
  author={Wei, Zhepei and Chen, Wei-Lin and Meng, Yu},
  year={2024}
}

About

InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published