Skip to content

TIGER-AI-Lab/ScholarCopilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScholarCopilot

| 🚀Project Page | 📖Paper | 🤗Data | 🤗Model | 🤗Demo |

Scholar Copilot is an intelligent academic writing assistant that enhances the research writing process through AI-powered text completion and citation suggestions. Built by TIGER-Lab, it aims to streamline academic writing while maintaining high scholarly standards.

🌟 Key Features

📝 Smart Text Generation

  • Next-3-Sentence Suggestions: Get contextually relevant suggestions for your next three sentences
  • Full Section Auto-Completion: Generate complete sections with appropriate academic structure and flow
  • Context-Aware Writing: All generations consider your existing text to maintain coherence

📚 Intelligent Citation Management

  • Real-time Citation Suggestions: Receive relevant paper citations based on your writing context
  • One-Click Citation Insertion: Easily select and insert citations in proper academic format
  • Citation Bibtex Generation: Automatically generate and export bibtex entries for your citations

Inference Pipeline Overview

Scholar Copilot employs a unified model architecture that seamlessly integrates retrieval and generation through a dynamic switching mechanism. During the generation process, the model autonomously determines appropriate citation points using learned citation patterns. When a citation is deemed necessary, the model temporarily halts generation, utilizes the hidden states of the citation token to retrieve relevant papers from the corpus, inserts the selected references, and then resumes coherent text generation.

image

🚀 Getting Started

To set up the ScholarCopilot demo on your own server, follow these simple steps:

  1. Clone the repository:
git clone git@github.com:TIGER-AI-Lab/ScholarCopilot.git
cd ScholarCopilot/run_demo
  1. Set up the environment:
pip install -r requirements.txt
  1. Download the required model and data:
bash download.sh
  1. Launch the demo:
bash run_demo.sh

Update new papers to the corpus

To update your corpus with the latest papers, follow these steps:

  1. Download the most recent arXiv metadata from Kaggle and save it to your chosen ARXIV_META_DATA_PATH
  2. Run the data processing script:
cd utils/
python process_arxiv_meta_data.py ARXIV_META_DATA_PATH ../data/corpus_data_arxiv_1215.jsonl
  1. Generate the embedding of the corpus:
bash encode_corpus.sh
  1. Convert the embedding to HNSW index for efficient search:
python build_hnsw_index.py --input_dir <embedding dir> --output_dir <hnsw index dir>

📖 Demo Video

Scholar Copilot Demo Video

Train your own model

  1. Download the training data:
cd train/
bash download.sh
  1. Configure and run the training script (To reproduce our results, you can use the hyperparameters in the script and 4 machines with 8 GPUs each (32 GPUs in total).)
cd src/
bash start_train.sh

Citation

@article{wang2024scholarcopilot,
  title={ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations},
  author = {Wang, Yubo and Ma, Xueguang and Nie, Ping and Zeng, Huaye and Lyu, Zhiheng and Zhang, Yuxuan and Schneider, Benjamin and Lu, Yi and Yue, Xiang and Chen, Wenhu},
  journal={arXiv preprint arXiv:2504.00824},
  year={2025}
}

About

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published