Skip to content

IAAR-Shanghai/SurveyX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SurveyX: Academic Survey Automation via Large Language Models

✨Welcome to SurveyX! If you want to experience the full features, please log in to our website. This open-source code only provides offline processing capabilities.✨
arxiv paper surveyx.cn huggingface paper github stars last commit
Wechat Group

If you find our work helpful, don't forget to give us a star! ⭐️
👉 Visit SurveyX 👈

[English | 中文]

🤔What is SurveyX?

surveyx_frame

SurveyX is an advanced academic survey automation system that leverages the power of Large Language Models (LLMs) to generate high-quality, domain-specific academic papers and surveys. By simply providing a paper title and keywords for literature retrieval, users can request comprehensive academic papers or surveys tailored to specific topics.


🆚 Full Version vs. Offline Open Source Version

The open-source code in this repository only provides offline processing capabilities. If you want to experience the full features, please log in to our website.

Missing features in the open-source version:

  1. Real-time online search: You can only generate surveys based on your own uploaded .md format references. The open-source version lacks access to our paper database, web crawler system, keyword expansion algorithms, and dual-layer semantic filtering for literature acquisition.
  2. Multimodal document parsing: The generated survey will not include image understanding or illustrations from the references.

🛠️ How to Use the Offline Open Source Version (This repo)

1. Prerequisites

  • Python 3.10+ (Anaconda recommended)
  • All Python dependencies in requirements.txt
  • LaTeX environment (for PDF compilation):
  • You need to convert all your reference documents to Markdown (.md) format and put them together in a single folder before running the pipeline.
sudo apt update && sudo apt install texlive-full

2. Installation

  1. Clone the repository:
git clone https://github.com/IAAR-Shanghai/SurveyX.git
cd SurveyX
  1. Install Python dependencies:
pip install -r requirements.txt

3. LLM Configuration

Edit src/configs/config.py to provide your LLM API URL, token, and model information before running the pipeline.

Example:

REMOTE_URL = "https://api.openai.com/v1/chat/completions"
TOKEN = "sk-xxxx..."
DEFAULT_EMBED_ONLINE_MODEL = "BAAI/bge-base-en-v1.5"
EMBED_REMOTE_URL = "https://api.siliconflow.cn/v1/embeddings"
EMBED_TOKEN = "your embed token here"

4. Workflow

Each run creates a unique result folder under outputs/, named by the task id outputs/<task_id> (e.g., outputs/2025-06-18-0935_keyword/).

Run the full pipeline:

python tasks/offline_run.py --title "Your Survey Title" --key_words "keyword1, keyword2, ..." --ref_path "path/to/your/reference/dir"

Or run step by step:

export task_id="your_task_id"
python tasks/workflow/03_gen_outlines.py --task_id $task_id
python tasks/workflow/04_gen_content.py --task_id $task_id
python tasks/workflow/05_post_refine.py --task_id $task_id
python tasks/workflow/06_gen_latex.py --task_id $task_id

Note: Your local reference documents must be in Markdown (.md) format and placed in a single directory.

5. Output

  • All results are saved under outputs/<task_id>/
    • survey.pdf: Final compiled survey
    • outlines.json: Generated outline
    • latex/: LaTeX sources
    • tmp/: Intermediate files

Example Papers

Title Keywords
A Survey of NoSQL Database Systems for Flexible and Scalable Data Management NoSQL, Database Systems, Flexibility, Scalability, Data Management
Vector Databases and Their Role in Modern Data Management and Retrieval A Survey Vector Databases, Data Management, Data Retrieval, Modern Applications
Graph Databases A Survey on Models, Data Modeling, and Applications Graph Databases, Data Modeling
A Survey on Large Language Model Integration with Databases for Enhanced Data Management and Survey Analysis Large Language Models, Database Integration, Data Management, Survey Analysis, Enhanced Processing
A Survey of Temporal Databases Real-Time Databases and Data Management Systems Temporal Databases, Real-Time Databases, Data Management
From BERT to GPT-4: A Survey of Architectural Innovations in Pre-trained Language Models Transformer, BERT, GPT-3, self-attention, masked language modeling, cross-lingual transfer, model scaling
Unsupervised Cross-Lingual Word Embedding Alignment: Techniques and Applications low-resource NLP, few-shot learning, data augmentation, unsupervised alignment, synthetic corpora, NLLB, zero-shot transfer
Vision-Language Pre-training: Architectures, Benchmarks, and Emerging Trends multimodal learning, CLIP, Whisper, cross-modal retrieval, modality fusion, video-language models, contrastive learning
Efficient NLP at Scale: A Review of Model Compression Techniques model compression, knowledge distillation, pruning, quantization, TinyBERT, edge computing, latency-accuracy tradeoff
Domain-Specific NLP: Adapting Models for Healthcare, Law, and Finance domain adaptation, BioBERT, legal NLP, clinical text analysis, privacy-preserving NLP, terminology extraction, few-shot domain transfer
Attention Heads of Large Language Models: A Survey attention head, attention mechanism, large language model, LLM,transformer architecture, neural networks, natural language processing
Controllable Text Generation for Large Language Models: A Survey controlled text generation, text generation, large language model, LLM,natural language processing
A survey on evaluation of large language models evaluation of large language models,large language models assessment, natural language processing, AI model evaluation
Large language models for generative information extraction: a survey information extraction, large language models, LLM,natural language processing, generative AI, text mining
Internal consistency and self feedback of LLM Internal consistency, self feedback, large language model, LLM,natural language processing, model evaluation, AI reliability
Review of Multi Agent Offline Reinforcement Learning multi agent, offline policy, reinforcement learning,decentralized learning, cooperative agents, policy optimization
Reasoning of large language model: A survey reasoning of large language models, large language models, LLM,natural language processing, AI reasoning, transformer models
Hierarchy Theorems in Computational Complexity: From Time-Space Tradeoffs to Oracle Separations P vs NP, NP-completeness, polynomial hierarchy, space complexity, oracle separation, Cook-Levin theorem
Classical Simulation of Quantum Circuits: Complexity Barriers and Implications BQP, quantum supremacy, Shor's algorithm, post-quantum cryptography, QMA, hidden subgroup problem
Kernelization: Theory, Techniques, and Limits fixed-parameter tractable (FPT), kernelization, treewidth, W-hierarchy, ETH (Exponential Time Hypothesis), parameterized reduction
Optimal Inapproximability Thresholds for Combinatorial Optimization Problems PCP theorem, approximation ratio, Unique Games Conjecture, APX-hardness, gap-preserving reduction, LP relaxation
Hardness in P: When Polynomial Time is Not Enough SETH (Strong Exponential Time Hypothesis), 3SUM conjecture, all-pairs shortest paths (APSP), orthogonal vectors problem, fine-grained reduction, dynamic lower bounds
Consistency Models in Distributed Databases: From ACID to NewSQL CAP theorem, ACID vs BASE, Paxos/Raft, Spanner, NewSQL, sharding, linearizability
Cloud-Native Databases: Architectures, Challenges, and Future Directions cloud databases, AWS Aurora, Snowflake, storage-compute separation, auto-scaling, pay-per-query, multi-tenancy
Graph Database Systems: Storage Engines and Query Optimization Techniques graph traversal, Neo4j, SPARQL, property graph, subgraph matching, RDF triplestore, Gremlin
Real-Time Aggregation in TSDBs: Techniques for High-Cardinality Data time-series data, InfluxDB, Prometheus, downsampling, time windowing, high-cardinality indexing, stream processing
Self-Driving Databases: A Survey of AI-Powered Autonomous Management autonomous databases, learned indexes, query optimization, Oracle AutoML, workload forecasting, anomaly detection
Multi-Model Databases: Integrating Relational, Document, and Graph Paradigms multi-model database, MongoDB, ArangoDB, JSONB, unified query language, schema flexibility, polystore
Vector Databases for AI: Efficient Similarity Search and Retrieval-Augmented Generation vector database, FAISS, Milvus, ANN search, embedding indexing, RAG (Retrieval-Augmented Generation), HNSW
Software-Defined Networking: Evolution, Challenges, and Future Scalability OpenFlow, control plane/data plane separation, NFV orchestration, network slicing, P4 language, OpenDaylight, scalability bottlenecks
Beyond 5G: Architectural Innovations for Terahertz Communication and Network Slicing network slicing, MEC (Multi-access Edge Computing), beamforming, mmWave, URLLC (Ultra-Reliable Low-Latency Communication), O-RAN, energy efficiency
IoT Network Protocols: A Comparative Study of LoRaWAN, NB-IoT, and Thread LPWAN, LoRa, ZigBee 3.0, 6LoWPAN, TDMA scheduling, RPL routing, device density management
Edge Caching in Content Delivery Networks: Algorithms and Economic Incentives CDN, Akamai, cache replacement policies, DASH (Dynamic Adaptive Streaming), QoE optimization, edge server placement, bandwidth cost reduction
A survey on flow batteries battery electrolyte formulation
Research on battery electrolyte formulation flow batteries

📃Citing SurveyX

Please cite us if you find this project helpful for your project/paper:

@misc{liang2025surveyxacademicsurveyautomation,
      title={SurveyX: Academic Survey Automation via Large Language Models}, 
      author={Xun Liang and Jiawei Yang and Yezhaohui Wang and Chen Tang and Zifan Zheng and Shichao Song and Zehao Lin and Yebin Yang and Simin Niu and Hanyu Wang and Bo Tang and Feiyu Xiong and Keming Mao and Zhiyu li},
      year={2025},
      eprint={2502.14776},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.14776}, 
}

Open Source Version Notice

This open source version of Surveyx is a simplified edition. It relies entirely on user-provided local reference documents and does not include advanced features such as:

  • Keyword expansion and filtering algorithms
  • Multimodal image parsing or figure extraction
  • Online reference search or automatic data fetching

These advanced modules are only available in the full version of Surveyx, which is hosted by MemTensor (Shanghai) Technology Co., Ltd. If you would like to experience the complete features, please visit our official website: surveyx.cn

For questions or issues, please open an issue on the repository.

⚠️ Disclaimer

SurveyX uses advanced language models to assist with the generation of academic papers. However, it is important to note that the generated content is a tool for research assistance. Users should verify the accuracy of the generated papers, as SurveyX cannot guarantee full compliance with academic standards.