✨Welcome to SurveyX! If you want to experience the full features, please log in to our website. This open-source code only provides offline processing capabilities.✨
👉 Visit SurveyX 👈
[English | 中文]
SurveyX is an advanced academic survey automation system that leverages the power of Large Language Models (LLMs) to generate high-quality, domain-specific academic papers and surveys. By simply providing a paper title and keywords for literature retrieval, users can request comprehensive academic papers or surveys tailored to specific topics.
The open-source code in this repository only provides offline processing capabilities. If you want to experience the full features, please log in to our website.
Missing features in the open-source version:
- Real-time online search: You can only generate surveys based on your own uploaded
.md
format references. The open-source version lacks access to our paper database, web crawler system, keyword expansion algorithms, and dual-layer semantic filtering for literature acquisition. - Multimodal document parsing: The generated survey will not include image understanding or illustrations from the references.
- Python 3.10+ (Anaconda recommended)
- All Python dependencies in
requirements.txt
- LaTeX environment (for PDF compilation):
- You need to convert all your reference documents to Markdown (
.md
) format and put them together in a single folder before running the pipeline.
sudo apt update && sudo apt install texlive-full
- Clone the repository:
git clone https://github.com/IAAR-Shanghai/SurveyX.git
cd SurveyX
- Install Python dependencies:
pip install -r requirements.txt
Edit src/configs/config.py
to provide your LLM API URL, token, and model information before running the pipeline.
Example:
REMOTE_URL = "https://api.openai.com/v1/chat/completions"
TOKEN = "sk-xxxx..."
DEFAULT_EMBED_ONLINE_MODEL = "BAAI/bge-base-en-v1.5"
EMBED_REMOTE_URL = "https://api.siliconflow.cn/v1/embeddings"
EMBED_TOKEN = "your embed token here"
Each run creates a unique result folder under outputs/
, named by the task id outputs/<task_id>
(e.g., outputs/2025-06-18-0935_keyword/
).
Run the full pipeline:
python tasks/offline_run.py --title "Your Survey Title" --key_words "keyword1, keyword2, ..." --ref_path "path/to/your/reference/dir"
Or run step by step:
export task_id="your_task_id"
python tasks/workflow/03_gen_outlines.py --task_id $task_id
python tasks/workflow/04_gen_content.py --task_id $task_id
python tasks/workflow/05_post_refine.py --task_id $task_id
python tasks/workflow/06_gen_latex.py --task_id $task_id
Note: Your local reference documents must be in Markdown (.md
) format and placed in a single directory.
- All results are saved under
outputs/<task_id>/
survey.pdf
: Final compiled surveyoutlines.json
: Generated outlinelatex/
: LaTeX sourcestmp/
: Intermediate files
Title | Keywords |
---|---|
A Survey of NoSQL Database Systems for Flexible and Scalable Data Management | NoSQL, Database Systems, Flexibility, Scalability, Data Management |
Vector Databases and Their Role in Modern Data Management and Retrieval A Survey | Vector Databases, Data Management, Data Retrieval, Modern Applications |
Graph Databases A Survey on Models, Data Modeling, and Applications | Graph Databases, Data Modeling |
A Survey on Large Language Model Integration with Databases for Enhanced Data Management and Survey Analysis | Large Language Models, Database Integration, Data Management, Survey Analysis, Enhanced Processing |
A Survey of Temporal Databases Real-Time Databases and Data Management Systems | Temporal Databases, Real-Time Databases, Data Management |
From BERT to GPT-4: A Survey of Architectural Innovations in Pre-trained Language Models | Transformer, BERT, GPT-3, self-attention, masked language modeling, cross-lingual transfer, model scaling |
Unsupervised Cross-Lingual Word Embedding Alignment: Techniques and Applications | low-resource NLP, few-shot learning, data augmentation, unsupervised alignment, synthetic corpora, NLLB, zero-shot transfer |
Vision-Language Pre-training: Architectures, Benchmarks, and Emerging Trends | multimodal learning, CLIP, Whisper, cross-modal retrieval, modality fusion, video-language models, contrastive learning |
Efficient NLP at Scale: A Review of Model Compression Techniques | model compression, knowledge distillation, pruning, quantization, TinyBERT, edge computing, latency-accuracy tradeoff |
Domain-Specific NLP: Adapting Models for Healthcare, Law, and Finance | domain adaptation, BioBERT, legal NLP, clinical text analysis, privacy-preserving NLP, terminology extraction, few-shot domain transfer |
Attention Heads of Large Language Models: A Survey | attention head, attention mechanism, large language model, LLM,transformer architecture, neural networks, natural language processing |
Controllable Text Generation for Large Language Models: A Survey | controlled text generation, text generation, large language model, LLM,natural language processing |
A survey on evaluation of large language models | evaluation of large language models,large language models assessment, natural language processing, AI model evaluation |
Large language models for generative information extraction: a survey | information extraction, large language models, LLM,natural language processing, generative AI, text mining |
Internal consistency and self feedback of LLM | Internal consistency, self feedback, large language model, LLM,natural language processing, model evaluation, AI reliability |
Review of Multi Agent Offline Reinforcement Learning | multi agent, offline policy, reinforcement learning,decentralized learning, cooperative agents, policy optimization |
Reasoning of large language model: A survey | reasoning of large language models, large language models, LLM,natural language processing, AI reasoning, transformer models |
Hierarchy Theorems in Computational Complexity: From Time-Space Tradeoffs to Oracle Separations | P vs NP, NP-completeness, polynomial hierarchy, space complexity, oracle separation, Cook-Levin theorem |
Classical Simulation of Quantum Circuits: Complexity Barriers and Implications | BQP, quantum supremacy, Shor's algorithm, post-quantum cryptography, QMA, hidden subgroup problem |
Kernelization: Theory, Techniques, and Limits | fixed-parameter tractable (FPT), kernelization, treewidth, W-hierarchy, ETH (Exponential Time Hypothesis), parameterized reduction |
Optimal Inapproximability Thresholds for Combinatorial Optimization Problems | PCP theorem, approximation ratio, Unique Games Conjecture, APX-hardness, gap-preserving reduction, LP relaxation |
Hardness in P: When Polynomial Time is Not Enough | SETH (Strong Exponential Time Hypothesis), 3SUM conjecture, all-pairs shortest paths (APSP), orthogonal vectors problem, fine-grained reduction, dynamic lower bounds |
Consistency Models in Distributed Databases: From ACID to NewSQL | CAP theorem, ACID vs BASE, Paxos/Raft, Spanner, NewSQL, sharding, linearizability |
Cloud-Native Databases: Architectures, Challenges, and Future Directions | cloud databases, AWS Aurora, Snowflake, storage-compute separation, auto-scaling, pay-per-query, multi-tenancy |
Graph Database Systems: Storage Engines and Query Optimization Techniques | graph traversal, Neo4j, SPARQL, property graph, subgraph matching, RDF triplestore, Gremlin |
Real-Time Aggregation in TSDBs: Techniques for High-Cardinality Data | time-series data, InfluxDB, Prometheus, downsampling, time windowing, high-cardinality indexing, stream processing |
Self-Driving Databases: A Survey of AI-Powered Autonomous Management | autonomous databases, learned indexes, query optimization, Oracle AutoML, workload forecasting, anomaly detection |
Multi-Model Databases: Integrating Relational, Document, and Graph Paradigms | multi-model database, MongoDB, ArangoDB, JSONB, unified query language, schema flexibility, polystore |
Vector Databases for AI: Efficient Similarity Search and Retrieval-Augmented Generation | vector database, FAISS, Milvus, ANN search, embedding indexing, RAG (Retrieval-Augmented Generation), HNSW |
Software-Defined Networking: Evolution, Challenges, and Future Scalability | OpenFlow, control plane/data plane separation, NFV orchestration, network slicing, P4 language, OpenDaylight, scalability bottlenecks |
Beyond 5G: Architectural Innovations for Terahertz Communication and Network Slicing | network slicing, MEC (Multi-access Edge Computing), beamforming, mmWave, URLLC (Ultra-Reliable Low-Latency Communication), O-RAN, energy efficiency |
IoT Network Protocols: A Comparative Study of LoRaWAN, NB-IoT, and Thread | LPWAN, LoRa, ZigBee 3.0, 6LoWPAN, TDMA scheduling, RPL routing, device density management |
Edge Caching in Content Delivery Networks: Algorithms and Economic Incentives | CDN, Akamai, cache replacement policies, DASH (Dynamic Adaptive Streaming), QoE optimization, edge server placement, bandwidth cost reduction |
A survey on flow batteries | battery electrolyte formulation |
Research on battery electrolyte formulation | flow batteries |
Please cite us if you find this project helpful for your project/paper:
@misc{liang2025surveyxacademicsurveyautomation,
title={SurveyX: Academic Survey Automation via Large Language Models},
author={Xun Liang and Jiawei Yang and Yezhaohui Wang and Chen Tang and Zifan Zheng and Shichao Song and Zehao Lin and Yebin Yang and Simin Niu and Hanyu Wang and Bo Tang and Feiyu Xiong and Keming Mao and Zhiyu li},
year={2025},
eprint={2502.14776},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14776},
}
This open source version of Surveyx is a simplified edition. It relies entirely on user-provided local reference documents and does not include advanced features such as:
- Keyword expansion and filtering algorithms
- Multimodal image parsing or figure extraction
- Online reference search or automatic data fetching
These advanced modules are only available in the full version of Surveyx, which is hosted by MemTensor (Shanghai) Technology Co., Ltd. If you would like to experience the complete features, please visit our official website: surveyx.cn
For questions or issues, please open an issue on the repository.
SurveyX uses advanced language models to assist with the generation of academic papers. However, it is important to note that the generated content is a tool for research assistance. Users should verify the accuracy of the generated papers, as SurveyX cannot guarantee full compliance with academic standards.