Awesome Resource-Efficient LLM Papers

A curated list of high-quality papers on resource-efficient LLMs.

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

LLM Architecture Design
- Efficient Transformer Architecture
- Non-transformer Architecture
LLM Pre-Training
- Memory Efficiency
  - Distributed Training
  - Mixed Precision Training
- Data Efficiency
LLM Fine-Tuning
- Parameter-Efficient Fine-Tuning
- Full-Parameter Fine-Tuning
LLM Inference
- Model Compression
  - Pruning
  - Quantization
- Dynamic Acceleration
  - Input Pruning
System Design
Resource-Efficiency Evaluation Metrics & Benchmarks
Reference

LLM Architecture Design

Efficient Transformer Architecture

Date	Keywords	Paper	Venue
2019	Approximate attention	Reformer: The efficient transformer	ICLR
2020	Approximate attention	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	ICML
2021	Approximate attention	Efficient attention: Attention with linear complexities	WACV
2021	Approximate attention	An Attention Free Transformer	ArXiv
2021	Approximate attention	Self-attention Does Not Need O(n^2) Memory	ArXiv
2023	Approximate attention	KDEformer: Accelerating Transformers via Kernel Density Estimation	ICML
2023	Approximate attention	Mega: Moving Average Equipped Gated Attention	ICLR
2021	Hardware optimization	LightSeq: A High Performance Inference Library for Transformers	NAACL
2021	Hardware optimization	FasterTransformer: A Faster Transformer Framework	GitHub
2022	Hardware optimization	xFormers - Toolbox to Accelerate Research on Transformers	GitHub
2023	Hardware optimization	Flashattention: Fast and memory-efficient exact attention with io-awareness	NeurIPS
2024	Hardware optimization	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	ICLR

Non-transformer Architecture

Date	Keywords	Paper	Venue
2017	Mixture of Experts	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	ICLR
2022	Mixture of Experts	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	JMLR
2022	Mixture of Experts	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	ICML
2022	Mixture of Experts	Mixture-of-Experts with Expert Choice Routing	NeurIPS
2022	Mixture of Experts	Efficient Large Scale Language Modeling with Mixtures of Experts	EMNLP
2023	RNN LM	RWKV: Reinventing RNNs for the Transformer Era	EMNLP-Findings
2023	MLP	Auto-Regressive Next-Token Predictors are Universal Learners	ArXiv
2023	Convolutional LM	Hyena Hierarchy: Towards Larger Convolutional Language models	ICML
2023	Sub-quadratic Matrices based	Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture	NeurIPS
2023	Selective State Space Model	Mamba: Linear-Time Sequence Modeling with Selective State Spaces	ArXiv

LLM Pre-Training

Memory Efficiency

Distributed Training

Date	Keywords	Paper	Venue
2020	Data Parallelism	Zero: Memory optimizations toward training trillion parameter models	IEEE SC20
2021	Data Parallelism	FairScale: A general purpose modular PyTorch library for high performance and large scale training	JMLR
2023	Data Parallelism	Palm: Scaling language modeling with pathways	Github
2018	Model Parallelism	Mesh-tensorflow: Deep learning for supercomputers	NeurIPS
2019	Model Parallelism	GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS
2019	Model Parallelism	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Arxiv
2019	Model Parallelism	PipeDream: generalized pipeline parallelism for DNN training	SOSP
2022	Model Parallelism	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI
2023	Model Parallelism	Bpipe: memory-balanced pipeline parallelism for training large language models	JMLR

Mixed precision training

Date	Keywords	Paper	Venue
2017	Mixed Precision Training	Mixed Precision Training	ICLR
2018	Mixed Precision Training	Bert: Pre-training of deep bidirectional transformers for language understanding	ACL
2022	Mixed Precision Training	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	Arxiv

Data Efficiency

Importance Sampling

Date	Keywords	Paper	Venue
2023	Survey on importance sampling	A Survey on Efficient Training of Transformers	IJCAI
2018	Importance sampling	Training Deep Models Faster with Robust, Approximate Importance Sampling	NeurIPS
2018	Importance sampling	Not All Samples Are Created Equal: Deep Learning with Importance Sampling	ICML
2021	Importance sampling	Deep Learning on a Data Diet: Finding Important Examples Early in Training	NeurIPS
2022	Importance sampling	Beyond neural scaling laws: beating power law scaling via data pruning	NeurIPS
2023	Importance sampling	Data-Juicer: A One-Stop Data Processing System for Large Language Models	Arxiv
2023	Importance sampling	INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models	EMNLP
2023	Importance sampling	Machine Learning Force Fields with Data Cost Aware Training	ICML

Data Augmentation

Date	Keywords	Paper	Venue
2023	Data augmentation	MixGen: A New Multi-Modal Data Augmentation	WACV
2023	Data augmentation	Augmentation-Aware Self-Supervision for Data-Efficient GAN Training	NeurIPS
2023	Data augmentation	Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis	EMNLP
2023	Data augmentation	FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization	EMNLP

Training Objective

Date	Keywords	Paper	Venue
2023	Training objective	Challenges and Applications of Large Language Models	Arxiv
2023	Training objective	Efficient Data Learning for Open Information Extraction with Pre-trained Language Models	EMNLP
2019	Masked language modeling	MASS: Masked Sequence to Sequence Pre-training for Language Generation	ICML
2022	Masked image modeling	Masked Autoencoders Are Scalable Vision Learners	CVPR
2023	Masked language-image modeling	Scaling Language-Image Pre-training via Masking	CVPR

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date	Keywords	Paper	Venue
2019	Masking-based fine-tuning	SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization	ACL
2021	Masking-based fine-tuning	BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models	ACL
2021	Masking-based fine-tuning	Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning	EMNLP
2021	Masking-based fine-tuning	Unlearning Bias in Language Models by Partitioning Gradients	ACL
2022	Masking-based fine-tuning	Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively	NeurIPS

Full-Parameter Fine-Tuning

Date	Keywords	Paper	Venue
2023	Comparative study betweeen full-parameter and LoRA-base fine-tuning	A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model	Arxiv
2023	Comparative study betweeen full-parameter and parameter-efficient fine-tuning	Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification	Arxiv
2023	Full-parameter fine-tuning with limited resources	Full Parameter Fine-tuning for Large Language Models with Limited Resources	Arxiv
2023	Memory-efficient fine-tuning	Fine-Tuning Language Models with Just Forward Passes	NeurIPS
2023	Full-parameter fine-tuning for medicine applications	PMC-LLaMA: Towards Building Open-source Language Models for Medicine	Arxiv
2022	Drawback of full-parameter fine-tuning	Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution	ICLR

LLM Inference

Model Compression

Pruning

Date	Keywords	Paper	Venue
2023	Unstructured Pruning	SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	ICML
2023	Unstructured Pruning	A Simple and Effective Pruning Approach for Large Language Models	ICLR
2023	Unstructured Pruning	AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers	TCAD
2023	Structured Pruning	LLM-Pruner: On the Structural Pruning of Large Language Models	NeurIPS
2023	Structured Pruning	LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation	ICML
2023	Structured Pruning	Structured Pruning for Efficient Generative Pre-trained Language Models	ACL
2023	Structured Pruning	ZipLM: Inference-Aware Structured Pruning of Language Models	NeurIPS
2023	Contextual Pruning	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML

Quantization

Date	Keywords	Paper	Venue
2023	Weight Quantization	Flexround: Learnable rounding based on element-wise division for post-training quantization	ICML
2023	Weight Quantization	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	EMNLP
2023	Weight Quantization	OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models	AAAI
2023	Weight Quantization	Gptq: Accurate posttraining quantization for generative pre-trained transformers	ICLR
2023	Weight Quantization	Dynamic Stashing Quantization for Efficient Transformer Training	EMNLP
2023	Weight Quantization	Quantization-aware and tensor-compressed training of transformers for natural language understanding	Interspeech
2023	Weight Quantization	QLoRA: Efficient Finetuning of Quantized LLMs	NeurIPS
2023	Weight Quantization	Stable and low-precision training for large-scale vision-language models	NeurIPS
2023	Weight Quantization	Prequant: A task-agnostic quantization approach for pre-trained language models	ACL
2023	Weight Quantization	Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization	ISCA
2023	Weight Quantization	Awq: Activationaware weight quantization for llm compression and acceleration	arXiv
2023	Weight Quantization	Spqr: A sparsequantized representation for near-lossless llm weight compression	arXiv
2023	Weight Quantization	SqueezeLLM: Dense-and-Sparse Quantization	arXiv
2023	Weight Quantization	LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	arXiv
2022	Activation Quantization	Gact: Activation compressed training for generic network architectures	ICML
2021	Activation Quantization	Ac-gc: Lossy activation compression with guaranteed convergence	NeurIPS
2022	Fixed-point Quantization	Boost Vision Transformer with GPU-Friendly Sparsity and Quantization	ACL

Dynamic Acceleration

Input Pruning

Date	Keywords	Paper	Venue
2021	Score-based Token Removal	Efficient sparse attention architecture with cascade token and head pruning	HPCA
2022	Score-based Token Removal	Learned Token Pruning for Transformers	KDD
2023	Score-based Token Removal	Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference	KDD
2021	Learning-based Token Removal	TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference	NAACL
2022	Learning-based Token Removal	Transkimmer: Transformer Learns to Layer-wise Skim	ACL
2023	Learning-based Token Removal	PuMer: Pruning and Merging Tokens for Efficient Vision Language Models	ACL
2023	Learning-based Token Removal	Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model	arXiv
2023	Learning-based Token Removal	SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models	arXiv

System Design

Deployment optimization

Date	Keywords	Paper	Venue
2022	Hardware offloading	DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	IEEE SC22
2023	Hardware offloading	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	PMLR
2023	Hardware offloading	Fast distributed inference serving for large language models	arXiv
2022	Collaborative inference	Petals: Collaborative Inference and Fine-tuning of Large Models	arXiv

Support Infrastructure

Date	Keywords	Paper	Venue
2018	Libraries	Mesh-TensorFlow: Deep Learning for Supercomputers	NeurIPS
2019	Libraries	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	IEEE SC22
2022	Libraries	DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	IEEE SC22
2022	Libraries	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI
2023	Libraries	Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training	ICPP
2023	Libraries	GPT-NeoX-20B: An Open-Source Autoregressive Language Model	ACL
2020	Edge devices	Lite Transformer with Long-Short Range Attention	arXiv
2021	Edge devices	Generate More Features with Cheap Operations for BERT	ACL
2021	Edge devices	SqueezeBERT: What can computer vision teach NLP about efficient neural networks?	SustaiNLP
2022	Edge devices	EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation	arXiv
2022	Edge devices	ProFormer: Towards On-Device LSH Projection-Based Transformers	ACL
2023	Edge devices	Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices	ICASSP
2023	Edge devices	Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly	arXiv
2023	Edge devices	Large Language Models Empowered Autonomous Edge AI for Connected Intelligence	arXiv

Other Systems

Date	Keywords	Paper	Venue
2023	Other Systems	Tabi: An Efficient Multi-Level Inference System for Large Language Models	EuroSys
2023	Other Systems	Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation	PACMMOD

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric	Description	Example Usage
FLOPs (Floating-point operations)	the number of arithmetic operations on floating-point numbers	[FLOPs]
Training Time	the total duration required for training, typically measured in wall-clock minutes, hours, or days	[minutes, days] [hours]
Inference Time/Latency	the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds	[end-to-end latency in seconds] [next token generation latency in milliseconds]
Throughput	the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS)	[tokens/s] [queries/s]
Speed-Up Ratio	the improvement in inference speed compared to a baseline model	[inference time speed-up] [throughput speed-up]

💾 Memory Metrics

Metric	Description	Example Usage
Number of Parameters	the number of adjustable variables in the LLM’s neural network	[number of parameters]
Model Size	the storage space required for storing the entire model	[peak memory usage in GB]

⚡️ Energy Metrics

Metric	Description	Example Usage
Energy Consumption	the electrical power used during the LLM’s lifecycle	[kWh]
Carbon Emission	the greenhouse gas emissions associated with the model’s energy usage	[kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

CodeCarbon

Carbontracker

experiment-impact-tracker

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

ML CO2 Impact

LLMCarbon

💵 Financial Cost Metric

Metric	Description	Example Usage
Dollars per parameter	the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric	Description	Example Usage
Communication Volume	the total amount of data transmitted across the network during a specific LLM execution or training run	[communication volume in TB]

💡 Other Metrics

Metric	Description	Example Usage
Compression Ratio	the reduction in size of the compressed model compared to the original model	[compress rate] [percentage of weights remaining]
Loyalty/Fidelity	the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment	[loyalty] [fidelity]
Robustness	the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output	[after-attack accuracy, query number]
Pareto Optimality	the optimal trade-offs between various competing factors	[Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)]

Benchmarks

Benchmark	Description	Paper
General NLP Benchmarks	an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc.	A Comprehensive Overview of Large Language Models
Dynaboard	an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore	Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
EfficientQA	an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems	NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
SustaiNLP 2020 Shared Task	a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference	Overview of the SustaiNLP 2020 Shared Task
ELUE (Efficient Language Understanding Evaluation)	a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission	Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
VLUE (Vision-Language Understanding Evaluation)	a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison	VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Long Range Arena (LAG)	a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency	Long Range Arena: A Benchmark for Efficient Transformers
Efficiency-aware MS MARCO	an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems	Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
media		media
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

License

tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome Resource-Efficient LLM Papers

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Non-transformer Architecture

LLM Pre-Training

Memory Efficiency

Distributed Training

Mixed precision training

Data Efficiency

Importance Sampling

Data Augmentation

Training Objective

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Full-Parameter Fine-Tuning

LLM Inference

Model Compression

Pruning

Quantization

Dynamic Acceleration

Input Pruning

System Design

Deployment optimization

Support Infrastructure

Other Systems

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

💾 Memory Metrics

⚡️ Energy Metrics

💵 Financial Cost Metric

📨 Network Communication Metric

💡 Other Metrics

Benchmarks

Reference

About

Topics

Resources

License

Stars

Watchers

Forks