Skip to content

PrunaAI/awesome-ai-efficiency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌟 Awesome AI Efficiency 🌟

Awesome MIT License

A curated list of resources dedicated to enhancing efficiency in AI systems. This repository covers a wide range of topics essential for optimizing AI models and processes, aiming to make AI faster, cheaper, smaller, and greener!

Topics Summary 🎨

Topic Description Topics
Quantization Reducing precision of AI models without loss Quantization
Pruning Removing unnecessary model parameters for efficiency Pruning
Caching Storing computation results for faster reuse Caching
Distillation Transferring knowledge from a large model to a smaller one Distillation
Factorization Breaking down complex models into simpler, efficient components Factorization
Compilation Optimizing model code for specific hardware and environments Compilation
Parameter-Efficient Fine-tuning Learning a subset of parameters PEFT
Speculative Decoding Decoding with batches SpecDec
Hardware Leveraging specialized hardware for faster model execution Hardware
Training Techniques for making model training faster and more efficient Training
Inference Optimizing the speed and resource usage during model inference Inference
Sustainability Strategies to reduce the environmental impact of AI systems Sustainability
Scalability Approaches for scaling AI models and infrastructure efficiently Scalability

If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!


Table of Contents


Facts πŸ“Š

  • 3-40Wh: Amount of energy consumed for one small to long ChatGPT query (Source, 2025)
  • 1L: Estimated amount of water required for 20-100 ChatGPT queries (Source, 2025)
  • 2 nuclear plants: Number of nuclear plants to constantly work ot generate enough energy if 80M people generate 5 pages per day (Source, 2025)
  • 1 smartphone charge: Amount of energy required to AI generate a couple of images or run a few thousands inference with an LLM (Source, 2024)
  • >10s: Time requried to generate 1 HD image with Flux on H100 or to generate 100 tokens with Llama 3 on T4 (Source and Source, 2024)
  • 61,848.0x: Difference between the highest and lowest energy use in energy leaderboard for AI models (Source, 2025).
  • 1,300MWh: GPT-3, for example, is estimated to use just under 1,300 megawatt hours (MWh) of electricity; about as much power as consumed annually by 130 US homes (Source, 2024)
  • 800M users/week: Amount of users using ChatGPT per week in 2025 (Source)
  • 1B messages/day: Amount of ChatGPT queries per day in 2025 (Source)
  • +160%: Expected increase of data center power consumption by 2030 (Source)

Tools πŸ› οΈ

  • ❀️ Pruna ❀️: A package to make AI models faster, smaller, faster, greener by combining compression methods (incl. quantization, pruning, caching, compilation, distillation...) on various hardware.
  • TensorRT: High-performance deep learning inference library for NVIDIA GPUs.
  • ONNX: Open Neural Network Exchange format for interoperability among deep learning frameworks.
  • Code Carbon: Library to track energy and carbon efficiency of various hardware.
  • LLM Perf: A framework for benchmarking the performance of transformers models with different hardwares, backends and optimizations.
  • ML.ENERGY Leaderboard: An initiative to benchmark energy efficiency of AI models.
  • AI Energy Score: An initiative to establish comparable energy efficiency ratings for AI models, helping the industry make informed decisions about sustainability in AI development.
  • Model Optimization Toolkit: TensorFlow toolkit for optimizing machine learning models for deployment and execution.
  • Green Coding: LLM service that you can use to prompt most open source models and see the resource usage.
  • EcoLogits: EcoLogits is a python library that tracks the energy consumption and environmental footprint of using generative AI models through APIs.
  • Perplexity Kernels: GPU kernels by Perplexity.
  • Fast Tokenizer: Fast tokenizer is an efficient and optimized tokenizer engine for llm inference serving.
  • WeightWatcher: WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data..
  • Cockpit: A Practical Debugging Tool for Training Deep Neural Networks.
  • Electrictiy Map: A live map showing the origin of the electricity in world regions and their CO2 intensity.
  • MLCA: A tool for machine learning life cycle assessment.
  • TritonParse: A visualization and analysis tool for Triton IR files, designed to help developers analyze, debug, and understand Triton kernel compilation processes.
  • Routing on Random Forests: A framework for training and serving LLM based on random forest-based routers, thus allowing to optimize for costs.

Articles πŸ“°


Reports πŸ“ˆ


Research Articles πŸ“„

Paper Year Venue Tags
Mirage: A Multi-Level Superoptimizer for Tensor Programs 2025 None
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization 2025 None Sustainability
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse 2025 None Caching
Hardware-Efficient Attention for Fast Decoding 2025 None Hardware
Model-Preserving Adaptive Rounding 2025 None Quantization
Frugal AI: Introduction, Concepts, Development and Open Questions 2025 None Sustainability
Making AI Less β€œThirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models 2025 None Sustainability
Efficient Time Series Processing for Transformers and State-Space Models through Token Merging 2025 None
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency 2025 None Inference
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference 2025 None Inference
s1: Simple test-time scaling 2025 None Inference
BitNet b1.58 2B4T Technical Report 2025 None Quantization
NdLinear Is All You Need for Representation Learning 2025 None Factorization
LoRI: Reducing Cross-Task Interference in Multi-Task LowRank Adaptation 2025 ICLR PEFT
FISH-Tuning: Enhancing PEFT Methods with Fisher Information 2025 None PEFT
Green Prompting 2025 None
Compression Scaling Laws:Unifying Sparsity and Quantization 2025 None PruningQuantization
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality 2025 ICLR Caching
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding 2025 ICLR SpecDec
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models 2025 None Quantization
Real-Time Video Generation with Pyramid Attention Broadcast 2025 ICLR Caching
Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models 2025 ICLR Pruning
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing 2025 ICLR Pruning
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention 2025 None
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute 2025 None
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling 2025 None Inference
SpinQuant: LLM Quantization with Learned Rotations 2025 ICLR Quantization
Making AI Less β€œThirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models 2025 None
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps 2025 None Inference
Distillation Scaling Laws 2025 None
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations 2025 None Quantization
Distillation Scaling Laws 2025 None Distillation
From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate 2025 None
Coca4ai: checking energy behaviors on AI data centers 2024 None Sustainability Scalability
How Green Can AI Be? A Study of Trends in Machine Learning Environmental Impacts 2024 None Sustainability
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs 2024 NeurIPS Quantization
The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information 2024 NeurIPS
Palu: Compressing KV-Cache with Low-Rank Projection 2024 None Quantization
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration 2024 MLSys Quantization
LOFIT: Localized Fine-tuning on LLM Representations 2024 NeurIPS PEFT
Outlier Weighed Layerwise Sparsity: A Missing Secret Sauce for Pruning LLMs to High Sparsity 2024 ICML Pruning
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality 2024 None Caching
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks 2024 ICML Quantization
QTIP: Quantization with Trellises and Incoherence Processing 2024 NeurIPS Quantization
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models 2024 EMNLP Quantization
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs 2024 NeurIPS Quantization
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving 2024 None Quantization
Extreme Compression of Large Language Models via Additive Quantization 2024 ICML Quantization
Fast Matrix Multiplications for Lookup Table-Quantized LLMs 2024 None Quantization
GPTVQ: The Blessing of Dimensionality for LLM Quantization 2024 None Quantization
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey 2024 None PEFT
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration 2024 None SpecDec
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices 2024 NeurIPS SpecDec
ShortGPT: Layers in Large Language Models are More Redundant Than You Expecthttps://arxiv.org/pdf/2403.03853 2024 None Pruning
Canvas: End-to-End Kernel Architecture Search in Neural Networks 2024 None Compilation
Scaling Laws for Precision 2024 None Quantization
DeepCache: Accelerating Diffusion Models for Free 2024 CVPR Caching
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding 2024 ACL Distillation
Power Hungry Processing: Watts Driving the Cost of AI Deployment? 2024 FaccT
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression 2024 ICML PruningQuantization
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem 2024 None Quantization
Position: Tensor Networks are a Valuable Asset for Green AI 2024 None Factorization
Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI 2024 None Sustainability
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes 2024 ICLR Pruning
Efficient Memory Management for Large Language Model Serving with PagedAttention 2023 SOSP Caching
Broken Neural Scaling Laws 2023 ICLR
Post Training Mixed Precision Quantization of Neural Networks using First-Order Information 2023 ICCV Quantization
Ring Attention with Blockwise Transformers for Near-Infinite Context 2023 None
A Practical Mixed Precision Algorithm for Post-Training Quantization 2023 None Quantization
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models 2023 ICML Quantization
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs 2023 None PEFTPruning
Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning 2023 Sustainable Computing: Informatics and Systems Sustainability
An experimental comparison of software-based power meters: focus on CPU and GPU 2023 CCGrid Hardware
Fast Inference from Transformers via Speculative Decoding 2023 ICML Caching
Efficient Streaming Language Models with Attention Sinks 2023 ICLR
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers 2023 None Quantization
Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance 2022 ECCV Quantization
Knowledge Distillation: A Good Teacher is Patient and Consistent 2022 CVPR Distillation
LoRA: Low-Rank Adaptation of Large Language Models 2022 ICLR PEFT
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale 2022 NeurIPS Quantization
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training 2022 ICML Quantization
Sustainable AI: Environmental Implications, Challenges and Opportunities 2022 None Sustainability
Learnable Lookup Table for Neural Network Quantization 2022 CVPR Quantization
Training Compute-Optimal Large Language Models 2022 None
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 2022 None
Towards a Unified View of Parameter-Efficient Transfer Learning 2022 ICLR PEFT
Parameter-Efficient Transfer Learning with Diff Pruning 2021 ACL PEFTPruning
What is the State of Neural Network Pruning? 2020 MLSys Pruning
Scaling Laws for Autoregressive Generative Modeling 2020 None
Model Compression via Distillation and Quantization 2018 ICLR Quantization
Optimal Brain Damage 1989 NeurIPs Pruning

Blogs πŸ“°


Books πŸ“š


Lectures πŸŽ“

  • AI Efficiency Courses: Slides, Exercises (2025) - Bertrand Charpentier
  • Data Compression, Theory and Applications: YouTube, Slides (2024) - Stanford
  • MIT Han's Lab (2024) - MIT Lecture by Han's lab
  • GPU Mode (2020) - Tutorials by GPU mode community

People πŸ§‘β€πŸ’»

Name Affiliation Research Interests Social Media
James Martin Better Tech AI Sustainability LinkedIn
Saleh Ashkboos ETH Zurich Quantization LinkedIn
Dan Alistarh IST Austria AI Compression LinkedIn
Elias Frantar OpenAI Quantization LinkedIn
Tim Dettmers CMU Quantization LinkedIn
Song Han MIT AI Efficiency LinkedIn
Scott Chamberlin TBD AI Efficiency LinkedIn
Benoit Petit Boavista Data Center Efficiency LinkedIn
Samuel RincΓ© Gen AI Impact AI Efficiency, Sustainability LinkedIn
ThΓ©o Alves Da Costa Ekimetrics AI Efficiency, Sustainability LinkedIn
Sasha Luccioni Hugging Face AI Sustainability LinkedIn
Anne-Laure Ligozat ENSIEE AI Sustainability LinkedIn

Organizations 🌍

Organization Description Website
Data4Good A platform that connects data scientists with social impact projects to address global challenges using data. data4good.org
Make.org A global platform that empowers citizens to propose and take action on social and environmental issues through collective projects. make.org
CodeCarbon A tool that helps track the carbon emissions of machine learning models and optimizes them for sustainability. codecarbon.io
Sustainable AI Coalition An organization dedicated to advancing sustainability in AI technologies and promoting best practices for green AI. sustainableaicoalition.org
FruitPunch AI A community that solves AI solutions for impact organizations that contribute to the SDG's. fruitpunch.ai

Contributing 🀝

Contributions are welcome! Please follow our contribution guidelines to add new resources or suggest improvements that promote AI efficiency.


License πŸ“„

This project is licensed under the MIT License. Feel free to share and use the resources as needed.

About

A curated list of materials on AI efficiency

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •