ScholarCopilot

Scholar Copilot is an intelligent academic writing assistant that enhances the research writing process through AI-powered text completion and citation suggestions. Built by TIGER-Lab, it aims to streamline academic writing while maintaining high scholarly standards.

🌟 Key Features

📝 Smart Text Generation

Next-3-Sentence Suggestions: Get contextually relevant suggestions for your next three sentences
Full Section Auto-Completion: Generate complete sections with appropriate academic structure and flow
Context-Aware Writing: All generations consider your existing text to maintain coherence

📚 Intelligent Citation Management

Real-time Citation Suggestions: Receive relevant paper citations based on your writing context
One-Click Citation Insertion: Easily select and insert citations in proper academic format
Citation Bibtex Generation: Automatically generate and export bibtex entries for your citations

Inference Pipeline Overview

Scholar Copilot employs a unified model architecture that seamlessly integrates retrieval and generation through a dynamic switching mechanism. During the generation process, the model autonomously determines appropriate citation points using learned citation patterns. When a citation is deemed necessary, the model temporarily halts generation, utilizes the hidden states of the citation token to retrieve relevant papers from the corpus, inserts the selected references, and then resumes coherent text generation.

🚀 Getting Started

To set up the ScholarCopilot demo on your own server, follow these simple steps:

Clone the repository:

git clone git@github.com:TIGER-AI-Lab/ScholarCopilot.git
cd ScholarCopilot/run_demo

Set up the environment:

pip install -r requirements.txt

Download the required model and data:

bash download.sh

Launch the demo:

bash run_demo.sh

Update new papers to the corpus

To update your corpus with the latest papers, follow these steps:

Download the most recent arXiv metadata from Kaggle and save it to your chosen ARXIV_META_DATA_PATH
Run the data processing script:

cd utils/
python process_arxiv_meta_data.py ARXIV_META_DATA_PATH ../data/corpus_data_arxiv_1215.jsonl

Generate the embedding of the corpus:

bash encode_corpus.sh

Convert the embedding to HNSW index for efficient search:

python build_hnsw_index.py --input_dir <embedding dir> --output_dir <hnsw index dir>

📖 Demo Video

Train your own model

Download the training data:

cd train/
bash download.sh

Configure and run the training script (To reproduce our results, you can use the hyperparameters in the script and 4 machines with 8 GPUs each (32 GPUs in total).)

cd src/
bash start_train.sh

Citation

@article{wang2024scholarcopilot,
  title={ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations},
  author = {Wang, Yubo and Ma, Xueguang and Nie, Ping and Zeng, Huaye and Lyu, Zhiheng and Zhang, Yuxuan and Schneider, Benjamin and Lu, Yi and Yue, Xiang and Chen, Wenhu},
  journal={arXiv preprint arXiv:2504.00824},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
run_demo		run_demo
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScholarCopilot

🌟 Key Features

📝 Smart Text Generation

📚 Intelligent Citation Management

Inference Pipeline Overview

🚀 Getting Started

Update new papers to the corpus

📖 Demo Video

Train your own model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

TIGER-AI-Lab/ScholarCopilot

Folders and files

Latest commit

History

Repository files navigation

ScholarCopilot

🌟 Key Features

📝 Smart Text Generation

📚 Intelligent Citation Management

Inference Pipeline Overview

🚀 Getting Started

Update new papers to the corpus

📖 Demo Video

Train your own model

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages