Skip to content

tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Resource-Efficient LLM Papers Awesome

A curated list of high-quality papers on resource-efficient LLMs.
Clean Energy GIF

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Date Keywords Paper Venue
2019 Approximate attention Reformer: The efficient transformer ICLR
2020 Approximate attention Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention ICML
2021 Approximate attention Efficient attention: Attention with linear complexities WACV
2021 Approximate attention An Attention Free Transformer ArXiv
2021 Approximate attention Self-attention Does Not Need O(n^2) Memory ArXiv
2023 Approximate attention KDEformer: Accelerating Transformers via Kernel Density Estimation ICML
2023 Approximate attention Mega: Moving Average Equipped Gated Attention ICLR
2021 Hardware optimization LightSeq: A High Performance Inference Library for Transformers NAACL
2021 Hardware optimization FasterTransformer: A Faster Transformer Framework GitHub
2022 Hardware optimization xFormers - Toolbox to Accelerate Research on Transformers GitHub
2023 Hardware optimization Flashattention: Fast and memory-efficient exact attention with io-awareness NeurIPS
2024 Hardware optimization FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ICLR

Non-transformer Architecture

Date Keywords Paper Venue
2017 Mixture of Experts Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ICLR
2022 Mixture of Experts Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR
2022 Mixture of Experts GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML
2022 Mixture of Experts Mixture-of-Experts with Expert Choice Routing NeurIPS
2022 Mixture of Experts Efficient Large Scale Language Modeling with Mixtures of Experts EMNLP
2023 RNN LM RWKV: Reinventing RNNs for the Transformer Era EMNLP-Findings
2023 MLP Auto-Regressive Next-Token Predictors are Universal Learners ArXiv
2023 Convolutional LM Hyena Hierarchy: Towards Larger Convolutional Language models ICML
2023 Sub-quadratic Matrices based Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture NeurIPS
2023 Selective State Space Model Mamba: Linear-Time Sequence Modeling with Selective State Spaces ArXiv

LLM Pre-Training

Memory Efficiency

Distributed Training

Date Keywords Paper Venue
2020 Data Parallelism Zero: Memory optimizations toward training trillion parameter models IEEE SC20
2021 Data Parallelism FairScale: A general purpose modular PyTorch library for high performance and large scale training JMLR
2023 Data Parallelism Palm: Scaling language modeling with pathways Github
2018 Model Parallelism Mesh-tensorflow: Deep learning for supercomputers NeurIPS
2019 Model Parallelism GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism NeurIPS
2019 Model Parallelism Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Arxiv
2019 Model Parallelism PipeDream: generalized pipeline parallelism for DNN training SOSP
2022 Model Parallelism Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI
2023 Model Parallelism Bpipe: memory-balanced pipeline parallelism for training large language models JMLR

Mixed precision training

Date Keywords Paper Venue
2017 Mixed Precision Training Mixed Precision Training ICLR
2018 Mixed Precision Training Bert: Pre-training of deep bidirectional transformers for language understanding ACL
2022 Mixed Precision Training BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Arxiv

Data Efficiency

Importance Sampling

Date Keywords Paper Venue
2023 Survey on importance sampling A Survey on Efficient Training of Transformers IJCAI
2018 Importance sampling Training Deep Models Faster with Robust, Approximate Importance Sampling NeurIPS
2018 Importance sampling Not All Samples Are Created Equal: Deep Learning with Importance Sampling ICML
2021 Importance sampling Deep Learning on a Data Diet: Finding Important Examples Early in Training NeurIPS
2022 Importance sampling Beyond neural scaling laws: beating power law scaling via data pruning NeurIPS
2023 Importance sampling Data-Juicer: A One-Stop Data Processing System for Large Language Models Arxiv
2023 Importance sampling INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models EMNLP
2023 Importance sampling Machine Learning Force Fields with Data Cost Aware Training ICML

Data Augmentation

Date Keywords Paper Venue
2023 Data augmentation MixGen: A New Multi-Modal Data Augmentation WACV
2023 Data augmentation Augmentation-Aware Self-Supervision for Data-Efficient GAN Training NeurIPS
2023 Data augmentation Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis EMNLP
2023 Data augmentation FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization EMNLP

Training Objective

Date Keywords Paper Venue
2023 Training objective Challenges and Applications of Large Language Models Arxiv
2023 Training objective Efficient Data Learning for Open Information Extraction with Pre-trained Language Models EMNLP
2019 Masked language modeling MASS: Masked Sequence to Sequence Pre-training for Language Generation ICML
2022 Masked image modeling Masked Autoencoders Are Scalable Vision Learners CVPR
2023 Masked language-image modeling Scaling Language-Image Pre-training via Masking CVPR

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date Keywords Paper Venue
2019 Masking-based fine-tuning SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization ACL
2021 Masking-based fine-tuning BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models ACL
2021 Masking-based fine-tuning Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning EMNLP
2021 Masking-based fine-tuning Unlearning Bias in Language Models by Partitioning Gradients ACL
2022 Masking-based fine-tuning Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively NeurIPS

Full-Parameter Fine-Tuning

Date Keywords Paper Venue
2023 Comparative study betweeen full-parameter and LoRA-base fine-tuning A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model Arxiv
2023 Comparative study betweeen full-parameter and parameter-efficient fine-tuning Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification Arxiv
2023 Full-parameter fine-tuning with limited resources Full Parameter Fine-tuning for Large Language Models with Limited Resources Arxiv
2023 Memory-efficient fine-tuning Fine-Tuning Language Models with Just Forward Passes NeurIPS
2023 Full-parameter fine-tuning for medicine applications PMC-LLaMA: Towards Building Open-source Language Models for Medicine Arxiv
2022 Drawback of full-parameter fine-tuning Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution ICLR

LLM Inference

Model Compression

Pruning

Date Keywords Paper Venue
2023 Unstructured Pruning SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML
2023 Unstructured Pruning A Simple and Effective Pruning Approach for Large Language Models ICLR
2023 Unstructured Pruning AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers TCAD
2023 Structured Pruning LLM-Pruner: On the Structural Pruning of Large Language Models NeurIPS
2023 Structured Pruning LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation ICML
2023 Structured Pruning Structured Pruning for Efficient Generative Pre-trained Language Models ACL
2023 Structured Pruning ZipLM: Inference-Aware Structured Pruning of Language Models NeurIPS
2023 Contextual Pruning Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML

Quantization

Date Keywords Paper Venue
2023 Weight Quantization Flexround: Learnable rounding based on element-wise division for post-training quantization ICML
2023 Weight Quantization Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling EMNLP
2023 Weight Quantization OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models AAAI
2023 Weight Quantization Gptq: Accurate posttraining quantization for generative pre-trained transformers ICLR
2023 Weight Quantization Dynamic Stashing Quantization for Efficient Transformer Training EMNLP
2023 Weight Quantization Quantization-aware and tensor-compressed training of transformers for natural language understanding Interspeech
2023 Weight Quantization QLoRA: Efficient Finetuning of Quantized LLMs NeurIPS
2023 Weight Quantization Stable and low-precision training for large-scale vision-language models NeurIPS
2023 Weight Quantization Prequant: A task-agnostic quantization approach for pre-trained language models ACL
2023 Weight Quantization Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization ISCA
2023 Weight Quantization Awq: Activationaware weight quantization for llm compression and acceleration arXiv
2023 Weight Quantization Spqr: A sparsequantized representation for near-lossless llm weight compression arXiv
2023 Weight Quantization SqueezeLLM: Dense-and-Sparse Quantization arXiv
2023 Weight Quantization LLM-QAT: Data-Free Quantization Aware Training for Large Language Models arXiv
2022 Activation Quantization Gact: Activation compressed training for generic network architectures ICML
2021 Activation Quantization Ac-gc: Lossy activation compression with guaranteed convergence NeurIPS
2022 Fixed-point Quantization Boost Vision Transformer with GPU-Friendly Sparsity and Quantization ACL

Dynamic Acceleration

Input Pruning

Date Keywords Paper Venue
2021 Score-based Token Removal Efficient sparse attention architecture with cascade token and head pruning HPCA
2022 Score-based Token Removal Learned Token Pruning for Transformers KDD
2023 Score-based Token Removal Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference KDD
2021 Learning-based Token Removal TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference NAACL
2022 Learning-based Token Removal Transkimmer: Transformer Learns to Layer-wise Skim ACL
2023 Learning-based Token Removal PuMer: Pruning and Merging Tokens for Efficient Vision Language Models ACL
2023 Learning-based Token Removal Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model arXiv
2023 Learning-based Token Removal SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models arXiv

System Design

Deployment optimization

Date Keywords Paper Venue
2022 Hardware offloading DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale IEEE SC22
2023 Hardware offloading FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU PMLR
2023 Hardware offloading Fast distributed inference serving for large language models arXiv
2022 Collaborative inference Petals: Collaborative Inference and Fine-tuning of Large Models arXiv

Support Infrastructure

Date Keywords Paper Venue
2018 Libraries Mesh-TensorFlow: Deep Learning for Supercomputers NeurIPS
2019 Libraries Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism IEEE SC22
2022 Libraries DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale IEEE SC22
2022 Libraries Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI
2023 Libraries Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP
2023 Libraries GPT-NeoX-20B: An Open-Source Autoregressive Language Model ACL
2020 Edge devices Lite Transformer with Long-Short Range Attention arXiv
2021 Edge devices Generate More Features with Cheap Operations for BERT ACL
2021 Edge devices SqueezeBERT: What can computer vision teach NLP about efficient neural networks? SustaiNLP
2022 Edge devices EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation arXiv
2022 Edge devices ProFormer: Towards On-Device LSH Projection-Based Transformers ACL
2023 Edge devices Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices ICASSP
2023 Edge devices Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly arXiv
2023 Edge devices Large Language Models Empowered Autonomous Edge AI for Connected Intelligence arXiv

Other Systems

Date Keywords Paper Venue
2023 Other Systems Tabi: An Efficient Multi-Level Inference System for Large Language Models EuroSys
2023 Other Systems Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation PACMMOD

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric Description Example Usage
FLOPs (Floating-point operations) the number of arithmetic operations on floating-point numbers [FLOPs]
Training Time the total duration required for training, typically measured in wall-clock minutes, hours, or days [minutes, days]
[hours]
Inference Time/Latency the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds [end-to-end latency in seconds]
[next token generation latency in milliseconds]
Throughput the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) [tokens/s]
[queries/s]
Speed-Up Ratio the improvement in inference speed compared to a baseline model [inference time speed-up]
[throughput speed-up]

💾 Memory Metrics

Metric Description Example Usage
Number of Parameters the number of adjustable variables in the LLM’s neural network [number of parameters]
Model Size the storage space required for storing the entire model [peak memory usage in GB]

⚡️ Energy Metrics

Metric Description Example Usage
Energy Consumption the electrical power used during the LLM’s lifecycle [kWh]
Carbon Emission the greenhouse gas emissions associated with the model’s energy usage [kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

💵 Financial Cost Metric

Metric Description Example Usage
Dollars per parameter the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric Description Example Usage
Communication Volume the total amount of data transmitted across the network during a specific LLM execution or training run [communication volume in TB]

💡 Other Metrics

Metric Description Example Usage
Compression Ratio the reduction in size of the compressed model compared to the original model [compress rate]
[percentage of weights remaining]
Loyalty/Fidelity the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment [loyalty]
[fidelity]
Robustness the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output [after-attack accuracy, query number]
Pareto Optimality the optimal trade-offs between various competing factors [Pareto frontier (cost and accuracy)]
[Pareto frontier (performance and FLOPs)]

Benchmarks

Benchmark Description Paper
General NLP Benchmarks an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc. A Comprehensive Overview of Large Language Models
Dynaboard an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
EfficientQA an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
SustaiNLP 2020 Shared Task a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference Overview of the SustaiNLP 2020 Shared Task
ELUE (Efficient Language Understanding Evaluation) a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
VLUE (Vision-Language Understanding Evaluation) a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Long Range Arena (LAG) a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency Long Range Arena: A Benchmark for Efficient Transformers
Efficiency-aware MS MARCO an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}

About

a curated list of high-quality papers on resource-efficient LLMs 🌱

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published