AdaDecode

Accelerating LLM Decoding with Adaptive Layer Parallelism
[arXiv] [Model] [Dataset] [X Summary]

AdaDecode is a fast and accurate LLM decoding method based on the core idea: Adaptive Early Prediction + Parallel Token Processing:

🧩 No draft model needed — just a lightweight LM head (0.2% model size)!
✅ Predict tokens early using trained lightweight LM heads
🚀 Start decoding the next token before finishing the current one
🛡️ Final-layer verification ensures identical output to standard decoding

What makes AdaDecode different from existing solutions (e.g., speculative decoding)?

Speculative decoding relies on an auxiliary drafter model, leading to increased memory usage and requiring the same tokenizer and vocabulary as the main model
Layer skipping bypasses certain layers, which results in missing KV cache at those layers and can introduce discrepancies in future token predictions
AdaDecode accelerates decoding by adaptively predicting future tokens early based on confidence (e.g., $t_2$ and $t_3$ are predicted from different intermediate layers), enabling earlier progression to subsequent tokens
- When future token steps require KV caches from the skipped layers (due to early predictions), these missing computations are executed in parallel with subsequent token processing (same-colored layers)
- A final verification step is employed to ensure output consistency with standard autoregressive decoding

Installation

Create a Python virtual environment and install all required packages.

conda create -n adadec python=3.10 -y
conda activate adadec
pip install -r requirements.txt

Evaluation

Use the following scripts to evaluate AdaDecode and compared with the standard autoregressive decoding.

Vanilla Autoregressive Decoding

bash run_vanilla.sh

AdaDecode

bash run_AdaDecode.sh

Model Checkpoints

Task	Model Size	Hugging Face Repo
	8B	meng-lab/AdaDecode-Llama-3.1-8B-Instruct-XSum
XSum	13B	meng-lab/AdaDecode-CodeLlama-13B-Instruct-XSum
	34B	meng-lab/AdaDecode-CodeLlama-34B-Instruct-XSum

	8B	meng-lab/AdaDecode-Llama-3.1-8B-Instruct-HumanEval
HumanEval	13B	meng-lab/AdaDecode-CodeLlama-13B-Instruct-HumanEval
	34B	meng-lab/AdaDecode-CodeLlama-34B-Instruct-HumanEval

	8B	meng-lab/AdaDecode-Llama-3.1-8B-Instruct-GSM8K
GSM8K	13B	meng-lab/AdaDecode-CodeLlama-13B-Instruct-GSM8K
	34B	meng-lab/AdaDecode-CodeLlama-34B-Instruct-GSM8K

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Zhepei (zhepei.wei@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!

Acknowledgements

This codebase is influenced by remarkable projects from the LLM community such as LayerSkip and Medusa.

Citation

Please cite our paper if you find the repo helpful in your work:

@inproceedings{
wei2025adadecode,
title={AdaDecode: Accelerating {LLM} Decoding with Adaptive Layer Parallelism},
author={Zhepei Wei and Wei-Lin Chen and Xinyu Zhu and Yu Meng},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=VnO2GEpmlb}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
self_speculation		self_speculation
.gitignore		.gitignore
README.md		README.md
arguments.py		arguments.py
benchmark.py		benchmark.py
data.py		data.py
generate.py		generate.py
requirements.txt		requirements.txt
run_AdaDecode.sh		run_AdaDecode.sh
run_vanilla.sh		run_vanilla.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AdaDecode

Accelerating LLM Decoding with Adaptive Layer Parallelism
[arXiv] [Model] [Dataset] [X Summary]

What makes AdaDecode different from existing solutions (e.g., speculative decoding)?

Installation

Evaluation

Vanilla Autoregressive Decoding

AdaDecode

Model Checkpoints

Bugs or Questions?

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

weizhepei/AdaDecode

Folders and files

Latest commit

History

Repository files navigation

AdaDecode

Accelerating LLM Decoding with Adaptive Layer Parallelism [arXiv] [Model] [Dataset] [X Summary]

What makes AdaDecode different from existing solutions (e.g., speculative decoding)?

Installation

Evaluation

Vanilla Autoregressive Decoding

AdaDecode

Model Checkpoints

Bugs or Questions?

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Accelerating LLM Decoding with Adaptive Layer Parallelism
[arXiv] [Model] [Dataset] [X Summary]

Packages