Paper list for broad topics in machine learning systems
NOTE: Survey papers are annotated with [Survey π] prefix.
- Paper List for Machine Learning Systems
- Table of Contents
- 1. Data Processing
- 2. Training System
- 2.1 Empirical study on ML Jobs
- 2.2 Resource scheduling
- 2.3 GPU sharing
- 2.4 GPU memory management and optimization
- 2.5 Distributed training
- 2.6 DL job failures & resilient training
- 2.7 AutoML
- 2.8 Communication optimization & network infrastructure for ML
- 2.9 Model compression
- 2.10 DNN compiler
- 2.11 GNN training system
- 3. Inference System
- 4. Mixture of Experts (MoE)
- 5. LLM Long Context
- 6. Federated Learning
- 7. Privacy-Preserving ML
- 8. ML APIs & Application-side Optimization
- 9. ML (LLM) for Systems
- 10. GPU Kernel Scheduling & Optimization
- 11. Energy efficiency for LLM (carbon-aware)
- 12. Retrieval-Augmented Generation (RAG)
- 13. Simulation
- Others
- References
- [arxiv'25] Scalable and Performant Data Loading
- [arxiv'25] OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
- [arxiv'25] The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
- [arxiv'25] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs
- [VLDB'25] cedar: Composable and Optimized Machine Learning Input Data Pipelines
- [HotInfra'24] Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines
- [arxiv'24] TensorSocket: Shared Data Loading for Deep Learning Training
- [arxiv'24] Efficient Tabular Data Preprocessing of ML Pipelines
- [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
- [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
- [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
- [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
- [VLDB'21] tf.data: A Machine Learning Data Processing Framework
- [arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
- [ATC'24] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
- [HotStorage'24] A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL Training
- [VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
- [arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
- [CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
- [VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
- [SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- [ATC'22] Cachew: Machine Learning Input Data Processing as a Service
- [OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
- [ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines
- [TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
- [ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
- [SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O
- [VLDB'25] Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression
- [ISCA'24] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models
- [arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
- [ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
- [SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
- [SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
- [NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
- [DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
- [VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices
- [ICDE'25] MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage
- [TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
- [SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
- [ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
- [EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
- [FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
- [HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
- [NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
- [CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
- [ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
- [ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
- [FAST'20] Quiver: An Informed Storage Cache for Deep Learning
- [ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
- [arXiv'19] Faster Neural Network Training with Data Echoing
- [HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters
- [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
- [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data
- [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines
- [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision
- [ICDE'25] Training Data Distribution Estimation for Optimized Pre-Training Data Management
- [arxiv'25] Mixtera: A Data Plane for Foundation Model Training
- [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
- [NSDI'24] Characterization of Large Language Model Development in the Datacenter
- [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI
) - [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly
)
-
[arxiv'24] Zeal: Rethinking Large-Scale Resource Allocation with "Decouple and Decompose"
-
[EuroSys'25] Eva: Cost-Efficient Cloud-Based Cluster Scheduling
-
[arxiv'25] TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
-
[TACO'24] Taming Flexible Job Packing in Deep Learning Training Clusters
-
[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
-
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
-
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
-
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
-
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
-
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
-
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
-
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
-
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
-
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
-
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
-
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
-
[Survey π] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
-
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
-
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
-
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
-
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
-
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
-
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
-
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
-
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
-
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
-
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
-
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
-
[Survey π] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
-
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
-
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI
) -
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (
Synergy
) -
[SIGCOMM'22] Multi-resource interleaving for deep learning training (
Muri
) -
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
-
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
-
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (
Helios
) -
[OSDI'21] Privacy Budget Scheduling (
DPF
) -
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (
AFS
) -
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
-
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (
GandivaFair
) -
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
-
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
-
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (
Gavel
) -
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
-
[MLSys'20] Resource Elasticity in Distributed Deep Learning
-
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
-
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly
) -
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
-
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning
- [arxiv'25] Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
- [EuroSys'25] Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing
- [arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
- [SC'24] ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
- [arxiv'24] Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
- [arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
- [ICPP'24] MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters
- [ASPLOS'24] RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
- [EuroSys'24] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
- [ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
- [NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
- [ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
- [arxiv'23] GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning
- [arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
- [SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- [PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
- [ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
- [MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- [OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
- [OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
- [RTAS'19] Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs
- [EuroSys'25] MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators
- [EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
- [FAST'25 WiP] Baton: Orchestrating GPU Memory for LLM Training on Heterogeneous Cluster
- [CGO'25] IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
- [arxiv'25] Memory Analysis on the Training Course of DeepSeek Models
- [IJCAI'24] LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
- [MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
- [arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
- [TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
- [ICML'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- [ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (
QSDP
) - [arxiv'23] Does compressing activations help model parallel training?
- [SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
- [HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
- [IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
- [ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
- [VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
- [ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
- [ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- [ICLR'21] Dynamic Tensor Rematerialization
- [SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
- [HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
- [MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
- [ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
- [ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
- [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
- [SC'20] ZeRO: memory optimizations toward training trillion parameter models
- [ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
- [PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
- [MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
- [arxiv'16] Training Deep Nets with Sublinear Memory Cost
-
[arxiv'25] ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
-
[arxiv'25] Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization
-
[arxiv'25] H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips
-
[ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
-
[arxiv'25] Balanced and Elastic End-to-end Training of Dynamic LLMs
-
[arxiv'25] ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
-
[arxiv'25] Parallel Scaling Law for Language Models
-
[OSDI'25] PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
-
[MLSys'25] Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training
-
[arxiv'25] Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters
-
[arxiv'25] You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
-
[arxiv'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
-
[arxiv'25] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
-
[arxiv'25] Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
-
[ASPLOS'25] FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
-
[ASPLOS'25] Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
-
[arxiv'25] Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware
-
[arxiv'25] PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
-
[arxiv'25] AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs
-
[arxiv'25] Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs
-
[arxiv'25] Scaling Inference-Efficient Language Models
-
[INFOCOM'25] Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud
-
[arxiv'25] MiniMax-01: Scaling Foundation Models with Lightning Attention
-
[ASPLOS'25] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
-
[EuroSys'25] JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs
-
[arxiv'24] Automatically Planning Optimal Parallel Strategy for Large Language Models
-
[arxiv'24] Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
-
[arxiv'24] Scaling Deep Learning Training with MPMD Pipeline Parallelism
-
[TPDS'24] UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training
-
[arxiv'24] Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences
-
[arxiv'24] HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
-
[arxiv'24] Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training
-
[Survey π] [ACM CSUR'24] Resource-efficient Algorithms and Systems of Foundation Models: A Survey
-
[SOSP'24] Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
-
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
-
[NeurIPS'24] Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
-
[NeurIPS'24] SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation
-
[SC'24] Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
-
[SC'24] Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
-
[arxiv'24] BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
-
[arxiv'24] Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
-
[SoCC'24] Distributed training of large language models on AWS Trainium
-
[arxiv'24] SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
-
[TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[arxiv'24] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
-
[arxiv'24] FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
-
[arxiv'24] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
-
[arxiv'24] PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
-
[arxiv'24] Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
-
[SOSP'24] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
-
[arxiv'24] Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
-
[arxiv'24] FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
-
[arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
-
[arxiv'24] Unicron: Economizing Self-Healing LLM Training at Scale
-
[arxiv'24] TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
-
[ICPP'24] AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU Cluster
-
[arxiv'24] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
-
[Survey π] [arxiv'24] Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
-
[COLM'24] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
-
[OSDI'24] nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
-
[ATC'24] Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
-
[ATC'24] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
-
[ATC'24] OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
-
[arxiv'24] LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
-
[arxiv'24] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
-
[HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
-
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
-
[ICML'24] Integrated Hardware Architecture and Device Placement Search
-
[MLSys'24] DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines
-
[MobiCom'24] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
-
[EuroSys'24] DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
-
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
-
[EuroMLSys@EuroSys'24] ML Training with Cloud GPU Shortages: Is Cross-Region the Answer?
-
[ASPLOS'24] AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
-
[ASPLOS'24] PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
-
[EuroSys'24] Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
-
[arxiv'24] BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
-
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
-
[arxiv'24] Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
-
[arxiv'24] GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
-
[arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
-
[arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
-
[arxiv'24] Accelerating Parallel Sampling of Diffusion Models
-
[arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
-
[TKDE'24] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
- arxiv version (2023): link
-
[NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
-
[NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
-
[NSDI'24] Resiliency at Scale: Managing Googleβs TPUv4 Machine Learning Supercomputer
-
[NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
-
[NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
-
[arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
-
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
-
[arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
-
[arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
-
[AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
-
[VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
-
[HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
-
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
-
[EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
-
[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
-
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
-
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
-
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
-
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
-
[arxiv'23] FP8-LM: Training FP8 Large Language Models
-
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
-
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
-
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
-
[arxiv'23] Modeling Parallel Programs using Large Language Models
-
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
-
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
-
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
-
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
-
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
-
[arxiv'23] Does compressing activations help model parallel training?
-
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
-
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
-
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
-
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
-
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
-
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
-
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
-
[NeurIPS'23] ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks
-
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
-
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
-
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
-
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
-
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
-
[MICRO'23] Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs
-
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
-
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
-
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
-
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
-
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
-
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
-
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
-
[Survey π] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
-
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
-
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
-
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
-
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
-
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
-
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
-
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
-
[MLSys'23] On Optimizing the Communication of Model Parallelism
-
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
-
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
-
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
-
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
-
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
-
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
-
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
-
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
-
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression
-
[arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
-
[arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
-
[ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
-
[NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
-
[SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
-
[MLSys'22] Pathways: Asynchronous distributed dataflow for ML
-
[MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
-
[MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
-
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
-
[ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
-
[NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
-
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
-
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
-
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
-
[HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
-
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
-
[NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
-
[arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
-
[arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
-
[JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
-
[TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
-
[ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
-
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
-
[MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
-
[ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
-
[NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
-
[ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
-
[ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
-
[ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
-
[SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
-
[SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (
PTD-P
orMegatron-LM v2
) -
[FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
-
[PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
-
[VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
-
[HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
-
[NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
-
[arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
-
[KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
-
[VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
-
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS
) -
[SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
-
[NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
-
[arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]
-
[HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
-
[IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
-
[MLSys'19] Beyond data and model parallelism for deep neural networks (
FlexFlow
) -
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
-
[EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
-
[EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (
Tofu
) -
[SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
-
[NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
-
[NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
-
[ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
-
[Survey π] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
-
[Survey π] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
-
[Survey π] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
- [ATC'25] SAVE: Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips
- [ATC'25] Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism
- [arxiv'25] Adaptra: Straggler-Resilient Hybrid-Parallel Training with Pipeline Adaptation
- [arxiv'25] Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
- [arxiv'25] Characterizing GPU Resilience and Impact on AI/HPC Systems
- [NSDI'25] BCP: A Unified Checkpointing System for Large Foundation Model Development
- [NSDI'25] Minder: Faulty Machine Detection for Large-scale Distributed Model Training
- [EuroSys'25] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
- [ASPLOS'25] PCcheck: Persistent Concurrent Checkpointing for ML
- [arxiv'24] MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale
- [arxiv'24] MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
- [arxiv'24] TrainMover: Efficient ML Training Live Migration with No Memory Overhead
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- [arxiv'24] Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
- [arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- [arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
- [SOSP'24] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
- [HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
- [EuroSys'24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
- [NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
- [arxiv'23] Unicron: Economizing Self-Healing LLM Training at Scale
- [VLDB'23] Eficient Fault Tolerance for Recommendation Model Training via Erasure Coding
- [SOSP'23] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
- [SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
- [NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
- [EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
- [ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
- [MLSys'21] Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
- [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
- [ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs
- [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
- [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework
- [arxiv'25] TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
- [arxiv'25] FLASH: Fast All-to-All Communication in GPU Clusters
- [arxiv'25] SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication
- [arxiv'25] MCMComm: Hardware-Software Co-Optimization for End-to-End Communication in Multi-Chip-Modules
- [arxiv'25] GenTorrent: Scaling Large Language Model Serving with An Overley Network
- [arxiv'25] Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
- [arxiv'25] FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
- [arxiv'25] An Extensible Software Transport Layer for GPU Networking
- [HPCA'25] Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization
- [arxiv'25] HeteroPod: XPU-Accelerated Infrastructure Offloading for Commodity Cloud-Native Applications
- [Survey π] [arxiv'25] GPU-centric Communication Schemes for HPC and ML Applications
- [EuroMLSys'25] TAGC: Optimizing Gradient Communication in Distributed Transformer Training
- [arxiv'25] UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture
- [MLSys'25] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
- [arxiv'25] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
- [NSDI'25] AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
- [NSDI'25] Efficient Direct-Connect Topologies for Collective Communications
- [arxiv'25] InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
- [IEEE MICRO'25] Understanding and Characterizing Communication Characteristics for Distributed Transformer Models
- [arxiv'25] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs
- [arxiv'25] Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
- [arxiv'25] The Power of Negative Zero: Datatype Customization for Quantized Large Language Models
- [arxiv'25] mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
- [NSDI'25] OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
- [arxiv'24] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication
- [arxiv'24] The Landscape of GPU-Centric Communication
- [arxiv'24] Revisiting the Time Cost Model of AllReduce
- [arxiv'24] LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
- [HotInfra'24] Immediate Communication for Distributed AI Tasks
- [NeurIPS'24] SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
- [SC'24] Optimizing Distributed ML Communication with Fused Computation-Collective Operations
- [SC'24] Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
- [NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
- [arxiv'24] LumosCore: Highly Scalable LLM Clusters with Optical Interconnect
- [TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
- [HOTI'24] Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives
- [HOTI'24] Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters
- [SC'24] Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration
- [HPDC'24] Near-Optimal Wafer-Scale Reduce
- [HPDC'24] Efficient all-to-all Collective Communication Schedules for Direct-connect Topologies
- [arxiv'24] HiCCL: A Hierarchical Collective Communication Library
- [ICS'24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
- [ICS'24] Snoopie: A Multi-GPU Communication Profiler and Visualizer
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [arxiv'24] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
- [arxiv'24] Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
- [arxiv'24] Demystifying the Communication Characteristics for Distributed Transformer Models
- [ICPP'24] Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
- [NAIC @ SIGCOMM'24] Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
- [NAIC @ SIGCOMM'24] Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
- [NAIC @ SIGCOMM'24] OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
- [HotNets'24] I've Got 99 Problems But FLOPS Ain't One
- [HotNets'24] MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning
- [HotNets'22] Congestion Control in Machine Learning Clusters
- [SIGCOMM'24] Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
- [SIGCOMM'24] RDMA over Ethernet for Distributed Training at Meta Scale
- [SIGCOMM'24] Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
- [SIGCOMM'24] MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
- [SIGCOMM'24] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [arxiv'24] ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
- [APNet'24] Understanding Communication Characteristics of Distributed Training
- [ICLR'24] ZeRO++: Extremely Efficient Collective Communication for Large Model Training
- [ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- [arxiv] [openreview]
- [MLSys'24] L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
- [MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- [ASPLOS'24] T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
- [ASPLOS'24] TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
- [ASPLOS'24] Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- [ASPLOS'24] Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM
- [NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
- [Survey π] [arxiv'23] Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
- [INFOCOM'23] Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks
- [ICDCS'23] bbTopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training
- [ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
- [IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
- [ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
- [ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- [EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
- [MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
- [MLSys'23] On Optimizing the Communication of Model Parallelism
- [NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
- [NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- [NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- [NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
- [EuroSys'22] Out-of-order backprop: an effective scheduling technique for deep learning
- [ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
- [ISCA'22] Software-hardware co-design for fast and scalable training of deep learning recommendation models
- [SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
- [PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
- [MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (
P^2
) - [ASPLOS'22] Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads (
CoCoNET
) - [EuroSys'21] DGCL: an efficient communication library for distributed GNN training
- [ICLR'21] Multi-Level Local SGD for Heterogeneous Hierarchical Networks
- [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
- [SC'21] Flare: flexible in-network allreduce
- [NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
- [ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
- [PPoPP'21] Synthesizing optimal collective algorithms (
SCCL
) - [SIGCOMM'21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training
- [ISCA'20] An in-network architecture for accelerating shared-memory multiprocessor collectives
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
- [MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
- [MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
- [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS
) - [MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (
P3
) - [MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
- [SOSP'19] A generic communication scheduler for distributed DNN training acceleration (
ByteScheduler
) - [ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- [arxiv'25] TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
- [arxiv'25] DECA: A Near-Core LLM Decompression Accelerator Supporting Out-of-Order Invocation
- [arxiv'25] ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition
- [ISCA'25] Transitive Array: An Efficient GEMM Accelerator with Result Reuse
- [arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
- [ICML'24] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- [ACL'23] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
- [ICLR'23] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- [OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
- [EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
- [ICML'22] TSPipe: Learn from Teacher Faster with Pipelines
- [arxiv'25] TileLang: A Composable Tiled Programming Model for AI Systems
- [arxiv'25] Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
- [arxiv'25] DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training
- [arxiv'24] Mirage: A Multi-Level Superoptimizer for Tensor Programs
- [ASPLOS'25] Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation
- [arxiv'25] Hercules: A Compiler for Productive Programming of Heterogeneous Systems
- [CC'25] LLM Compiler: Foundation Language Models for Compiler Optimization
- [CGO'25] IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
- [SOSP'24] Scaling Deep Learning Computation over the Inter-core Connected Intelligence Processor with T10
- [OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
- [OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
- [OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
- [OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
- [OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- [OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
- [ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
- [OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.
- [ICDE'25] CaliEX: A Disk-Based Large-Scale GNN Training System with Joint Design of Caching and Execution
- [arxiv'25] Plexus: Taming Billion-edge Graphs with 3D Parallel GNN Training
- [HPCA'25] Mithril: A Scalable System for Deep GNN Training
- [arxiv'25] Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks
- [VLDB'25] NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism
- [arxiv'24] FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
- [ICPP'24] GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training
- [VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
- [arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
- [arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
- [arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
- [MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
- [SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
- [OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
- [EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
- [KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
- [VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
- [OSDI'21] P3: Distributed Deep Graph Learning at Scale
- [ICLR'25] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- [arxiv'25] Cascadia: A Cascade Serving System for Large Language Models
- [arxiv'25] Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
- [arxiv'25] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference
- [arxiv'25] EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving
- [arxiv'25] SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference
- [arxiv'25] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
- [arxiv'25] HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
- [arxiv'25] ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
- [arxiv'25] TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
- [arxiv'25] Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
- [OSDI'25] Clover: Exploiting Intra-device Parallelism for High Throughput Large Language Model Serving
- [arxiv'25] ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
- [arxiv'25] ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
- [arxiv'25] Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
- [arxiv'25] Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
- [arxiv'25] Tempo: Application-aware LLM Serving with Mixed SLO Requirements
- [arxiv'25] Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
- [arxiv'25] Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'25] Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving
- [arxiv'25] Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration
- [Survey π] [arxiv'25] Taming the Titans: A Survey of Efficient LLM Inference Serving
- [MLSys'25] SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling
- [arxiv'25] PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
- [arxiv'25] Circinus: Efficient Query Planner for Compound ML Serving
- [arxiv'25] HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
- [Mobicom'25] D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
- [arxiv'25] SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference
- [arxiv'25] gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
- [arxiv'25] Optimizing SLO-oriented LLM Serving with PD-Multiplexing
- [arxiv'25] SLO-Aware Scheduling for Large Language Model Inferences
- [arxiv'25] Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
- [ISPASS'25] Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
- [arxiv'25] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
- [arxiv'25] DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving
- [arxiv'25] Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
- [arxiv'25] SLOs-Serve: Optimized Serving of Multi-SLO LLMs
- [arxiv'25] Understanding and Optimizing Multi-Stage AI Inference Pipelines
- [arxiv'24] Fast and Live Model Auto Scaling with O(1) Host Caching
- [SIGMOD'25] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
- [EuroMLSys'25] Performance Aware LLM Load Balancer for Mixed Workloads
- [MLSys'25] Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
- [arxiv'25] WaferLLM: A Wafer-Scale LLM Inference System
- [HPCA'25] PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
- [HPCA'25] throttLL'eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
- [arxiv'25] Niyama : Breaking the Silos of LLM Inference Serving
- [ASPLOS'25] Aqua: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains
- [ASPLOS'25] Past-Future Scheduler for LLM Serving under SLA Guarantees
- [ASPLOS'25] Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
- [EuroSys'25] SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
- [EuroSys'25] Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU Clusters
- [arxiv'25] Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
- [EuroSys'25] NeuStream: Bridging Deep Learning Serving and Stream Processing
- [arxiv'25] ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving
- [arxiv'25] PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling
- [ISCA'25] Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- [arxiv'25] Jenga: Effective Memory Management for Serving LLM with Heterogeneity
- [arxiv'25] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
- [FAST'25] Mooncake: Trading More Storage for Less Computation β A KVCache-centric Architecture for Serving LLM Chatbot
- [arxiv'25] Collaborative Speculative Inference for Efficient LLM Inference Serving
- [NSDI'25] SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
- [arxiv'25] Seesaw: High-throughput LLM Inference via Model Re-sharding
- [arxiv'25] SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
- [arxiv'25] ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput
- [arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
- [WWW'25] External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation
- [arxiv'25] Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
- [arxiv'25] KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
- [arxiv'25] Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale
- [arxiv'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
- [arxiv'25] HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
- [arxiv'25] Autellix: An Efficient Serving Engine for LLM Agents as General Programs
- [MLSys'25] ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
- [ICLR'25] HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment
- [arxiv'25] Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
- [EuroSys'25] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
- [ASPLOS'25] Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow
- [ASPLOS'25] Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
- [arxiv'25] MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving
- [arxiv'25] Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
- [arxiv'25] Towards Efficient Large Multimodal Model Serving
- [arxiv'25] HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators
- [arxiv'25] HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
- [arxiv'25] Locality-aware Fair Scheduling in LLM Serving
- [arxiv'25] DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
- [arxiv'25] DeepFlow: Serverless Large Language Model Serving at Scale
- [arxiv'25] iServe: An Intent-based Serving System for LLMs
- [arxiv'25] AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
- [arxiv'25] EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation
- [arxiv'25] OMEGA: A Low-Latency GNN Serving System for Large Graphs
- [arxiv'25] PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
- [arxiv'25] Hierarchical Autoscaling for Large Language Model Serving with Chiron
- [arxiv'25] Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
- [arxiv'25] Accelerated Diffusion Models via Speculative Sampling
- [arxiv'25] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- [EuroSys'25 (to appear)] A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro
- [arxiv'24] Efficiently Serving LLM Reasoning Programs with Certaindex
- [arxiv'24] LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
- [arxiv'24] TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications
- [arxiv'24] Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
- [arxiv'24] KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management
- [arxiv'24] Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
- [arxiv'24] SYMPHONY: Improving Memory Management for LLM Inference Workloads
- [arxiv'24] A System for Microserving of LLMs
- [arxiv'24] HashAttention: Semantic Sparsity for Faster Inference
- [arxiv'24] SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
- [arxiv'24] Unifying KV Cache Compression for Large Language Models with LeanKV
- [arxiv'24] Marconi: Prefix Caching for the Era of Hybrid LLMs
- [arxiv'24] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
- [Survey π] [ACM CSUR'24] Resource-efficient Algorithms and Systems of Foundation Models: A Survey
- [arxiv'24] BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
- [arxiv'24] SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [Code]
- [arxiv'24] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [Code]
- [arxiv'24] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
- [ACL'24] LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
- [ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
- [arxiv'24] EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
- [IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
- [arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- [NeurIPS'24] Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting
- [NeurIPS'24] Toward Efficient Inference for Mixture of Experts
- [NeurIPS'24] Sequoia: Scalable and Robust Speculative Decoding
- [arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
- [SC'24] PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
- [SC'24] SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
- [arxiv'24] SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
- [arxiv'24] V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
- [SenSys'24] LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning
- [arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
- [arxiv'24] NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
- [MICRO'24] Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
- [arxiv'24] VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
- [arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [arxiv'24] Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
- [arxiv'24] POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
- [PML4LRS @ ICLR2024] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- [arxiv'24] MagicPIG: LSH Sampling for Efficient LLM Generation
- [arxiv'24] Revisiting SLO and Goodput Metrics in LLM Serving
- [arxiv'24] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
- [arxiv'24] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
- [EuroSys'25] Fast State Restoration in LLM Serving with HCache
- [arxiv'24] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
- [arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- [arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- [HPCA'24] KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers
- [arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
- [NeurIPS'24] Efficient LLM Scheduling by Learning to Rank
- [arxiv'24] P/D-Serve: Serving Disaggregated Large Language Model at Scale
- [arxiv'24] NanoFlow: Towards Optimal Large Language Model Serving Throughput
- [arxiv'24] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
- [SOSP'24] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
- [SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- [SOSP'24] Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
- [SOSP'24] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- [arxiv'24] LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
- [ICPP'24] GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference Models
- [SIGCOMM'24] CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- [ES-FoMO @ ICML'24] CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models
- [OSDI'24] dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- [OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- [OSDI'24] USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- [OSDI'24] Fairness in Serving Large Language Models
- [OSDI'24] MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
- [OSDI'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- [OSDI'24] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
- [OSDI'24] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- [OSDI'24] Llumnix: Dynamic Scheduling for Large Language Model Serving
- [OSDI'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- [ATC'24] Power-aware Deep Learning Model Serving with ΞΌ-Serve
- [ATC'24] Fast Inference for Probabilistic Graphical Models
- [ATC'24] Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
- [ATC'24] PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
- [ATC'24] Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
- [TPDS'24] ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
- [Survey π] [arxiv'24] LLM Inference Serving: Survey of Recent Advances and Opportunities
- [arxiv'24] Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
- [arxiv'24] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
- [arxiv'24] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
- [OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- [arxiv'24] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
- [ISCA'24] ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models
- [ISCA'24] Splitwise: Efficient generative LLM inference using phase splitting
- [ICML'24] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
- [ICML'24] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- [ICML'24] HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
- [ICML'24] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- [ICML'24] MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
- [HPCA'24] An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
- [MobiSys'24] ARISE: High-Capacity AR Offloading Inference Serving via Proactive Scheduling
- [MobiSys'24] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
- [arxiv'24] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
- [arxiv'24] HawkVision: Low-Latency Modeless Edge AI Serving
- [MLSys'24] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
- [MLSys'24] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- [MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
- [arxiv'24] The CAP Principle for LLM Serving
- [WWW'24] Ξ»Grapher: A Resource-Efficient Serverless System for GNN Serving through Graph Sharing
- [ICML'24] CLLMs: Consistency Large Language Models
- [arxiv'24] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- [EuroSys'24] Model Selection for Latency-Critical Inference Serving
- [arxiv'24] MΓ©lange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
- [arxiv'24] Learn To be Efficient: Build Structured Sparsity in Large Language Models
- [arxiv'24] Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
- [ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- [arxiv'24] Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding
- [arxiv'24] ALTO: An Efficient Network Orchestrator for Compound AI Systems
- [ASPLOS'24] ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
- [ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
- [arxiv'24] ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
- [arxiv'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- [ICML'24] DΓ©jΓ Vu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
- [ICLR'24] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
- [arxiv'24] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
- [arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
- [arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
- [arxiv'24] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
- [NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
- [arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
- [arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- [arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
- [arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- [arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- [Survey π] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
- [arxiv'24] Learned Best-Effort LLM Serving
- [arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- [VLDB'24] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
- [ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
- [ASPLOS'24] SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification
- [arxiv'23] DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
- [EMNLP'23] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
- [arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
- [arxiv'23] Fairness in Serving Large Language Models
- [arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
- [arxiv'23] Punica: Multi-Tenant LoRA Serving
- [arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
- [arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- [arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
- [NeurIPS'23] SpecTr: Fast Speculative Decoding via Optimal Transport
- [HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
- [SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [MLSys'23] Efficiently Scaling Transformer Inference
- [EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
- [EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
- [EuroSys'23] Pocket: ML Serving from the Edge
- [OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- [NSDI'23] SHEPHERD: Serving DNNs in the Wild
- [VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
- [ICML'23] Fast Inference from Transformers via Speculative Decoding
- [SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
- [OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
- [OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
- [ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
- [ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- [ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
- [ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
- [ATC'21] INFaaS: Automated Model-less Inference Serving
- [SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
- [arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
- [MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
This is the list of papers about MoE training and inference (collected from 2.6 and 3).
- [arxiv'25] FlashDMoE: Fast Distributed MoE in a Single Kernel
- [arxiv'25] EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
- [arxiv'25] CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning
- [arxiv'25] PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval
- [arxiv'25] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
- [arxiv'25] MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
- [ATC'25] PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch
- [arxiv'25] Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
- [ICML'25] FloE: On-the-Fly MoE Inference on Memory-constrained GPU
- [arxiv'25] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning
- [arxiv'25] MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance
- [arxiv'25] Faster MoE LLM Inference for Extremely Large Models
- [arxiv'25] Accelerating Mixture-of-Experts Training with Adaptive Expert Replication
- [NAACL'25] MoLA: MoE LoRA with Layer-wise Expert Allocation
- [NAACL'25] Marrying LLMs with Dynamic Forecasting: A Graph Mixture-of-expert Perspective
- [NAACL'25] Sparser Mixture-of-Adapters with Cross-Layer Generalization
- [NAACL'25] SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse
- [Mobicom'25] D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
- [arxiv'25] MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
- [arxiv'25] MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
- [arxiv'25] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
- [arxiv'25] Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models
- [arxiv'25] Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
- [arxiv'25] MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
- [arxiv'25] C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
- [arxiv'25] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models
- [arxiv'25] S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning
- [DAC'25] HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
- [arxiv'25] Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
- [arxiv'25] HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs
- [arxiv'25] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
- [TKDE'25] A Survey on Mixture of Experts
- [ICLR'25] NetMoE: Accelerating MoE Training through Dynamic Sample Placement
- [arxiv'25] ProMoE: Fast MoE-based LLM Serving using Proactive Caching
- [arxiv'25] Mixture of Lookup Experts
- [EuroSys'25] Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores
- [EuroMLSys'25] Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
- [EuroMLSys'25] Accelerating MoE Model Inference with Expert Sharding
- [arxiv'25] eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
- [KDD'25] ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
- [arxiv'25] Continual Pre-training of MoEs: How robust is your router?
- [arxiv'25] Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
- [arxiv'25] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
- [arxiv'25] Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling
- [MLSys'25] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
- [arxiv'25] CoSMoEs: Compact Sparse Mixture of Experts
- [CVPR'25] DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
- [ASPLOS'25] CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
- [arxiv'25] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
- [arxiv'25] BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
- [arxiv'25] DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
- [arxiv'25] Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models
- [arxiv'25] MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing
- [arxiv'25] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
- [arxiv'25] Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models
- [arxiv'25] fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
- [TPDS'25] EfficientMoE: Optimizing Mixture-of-Experts Model Training with Adaptive Load Balance
- [arxiv'25] Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism
- [NAACL'25] MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
- [arxiv'25] BTS: Harmonizing Specialized Experts into a Generalist LLM
- [ASPLOS'25] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
- [arxiv'25] Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
- [arxiv'25] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
- [arxiv'25] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
- [MICRO'24] SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
- [TPDS'24] MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
- Journal version of [IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
- [arxiv'24] DeepSeek-V3 Technical Report
- [arxiv'24] HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy
- [arxiv'24] Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation
- [arxiv'24] Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
- [arxiv'24] ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
- [Survey π] [arxiv'24] A Survey on Inference Optimization Techniques for Mixture of Experts Models
- [arxiv'24] DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
- [arxiv'24] Llama 3 Meets MoE: Efficient Upcycling
- [arxiv'24] Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
- [arxiv'24] Mixture of A Million Experts
- [arxiv'24] MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems
- [arxiv'24] MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
- [arxiv'24] Toward Inference-optimal Mixture-of-Expert Large Language Models
- [arxiv'24] Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection
- [MLArchSys'24 @ ISCA'24] MoE-ERAS: Expert Residency Aware Selection
- [arxiv'24] MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
- [arxiv'24] Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing
- [arxiv'24] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- [COLM'24] Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
- [ME-FoMo @ ICLR'24] Scaling Laws for Fine-Grained Mixture of Experts
- [arxiv'24] UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLMS
- [ML for Sys workshop @ NeurIPS'24] IFMoE: An Inference Framework Design for Fine-grained MoE
- [ML for Sys workshop @ NeurIPS'24] TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation
- [arxiv'24] Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
- [arxiv'24] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
- [arxiv'24] Pro-Prophet: Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models
- [EMNLP'24] MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning
- [EMNLP'24] Mixture of Diverse Size Experts
- [EMNLP'24] AdaMOE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
- [ACL'24] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
- [ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
- [SoCC'24] MoEsaic: Shared Mixture of Experts
- [KDD'24] Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing
- [arxiv'24] Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
- [IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
- [arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- [arxiv'24] Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
- [NeurIPS'24] Toward Efficient Inference for Mixture of Experts
- [arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
- [MLSys'24] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
- [SC'24] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes
- [NeurIPS'24] GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts
- [arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
- [arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
- [NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
- [arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
- [arxiv'24] Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
- [NeurIPS'24] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
- [arxiv'24] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
- [arxiv'24] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
- [PML4LRS @ ICLR'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- [arxiv'24] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
- [arxiv'24] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
- [arxiv'24] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
- [arxiv'24] MoH: Multi-Head Attention as Mixture-of-Head Attention
- [arxiv'24] AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach
- [NeurIPS'24 (Splotlight)] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
- [arxiv'24] Aria: An Open Multimodal Native Mixture-of-Experts Model
- [arxiv'24] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
- [arxiv'24] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
- [arxiv'24] Upcycling Large Language Models into Mixture of Experts
- [arxiv'24] No Need to Talk: Asynchronous Mixture of Language Models
- [arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- [arxiv'24] HMoE: Heterogeneous Mixture of Experts for Language Modeling
- [arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
- [arxiv'24] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
- [arxiv'24] Layerwise Recurrent Router for Mixture-of-Experts
- [arxiv'24] Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
- [SRW @ ACL'24] MoExtend: Tuning New Experts for Modality and Task Extension
- [arxiv'24] MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts
- [arxiv'24] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
- [arxiv'24] Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
- [arxiv'24] Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
- [ICML'24] Scaling Laws for Fine-Grained Mixture of Experts
- [ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
- [MLSys'24] QMoE: Sub-1-Bit Compression of Trillion-Parameter Models
- [MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- [arxiv'24] CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
- [arxiv'24] AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts
- [SIGIR'24] M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework
- [EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
- [arxiv'24] MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts
- [ICLR'24] Mixture of LoRA Experts
- [arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- [arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
- [IJCAI'24] LocMoE: A Low-overhead MoE for Large Language Model Training
- [ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- [IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
- [EMNLP'23] Adaptive Gating in Mixture-of-Experts based Language Models
- [ACL'23] AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
- [ICLR'23] Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
- [ICML'23] Brainformers: Trading Simplicity for Efficiency
- [arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
- [arxiv'23] Fast Inference of Mixture-of-Experts Language Models with Offloading
- [ATC'23] Accelerating Distributed MoE Training and Inference with Lina
- [ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
- [OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
- [SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
- [ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
- [MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- [MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
- [arxiv'22] ST-MoE: Designing Stable and Transferable Sparse Expert Models
- [PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
- [SustaiNLP @ EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
- [NeurIPS'22] Mixture-of-Experts with Expert Choice Routing
- [ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- [ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- [JMLR'22] Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
- [EMNLP'21] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
- [ICLR'17] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- [arxiv'25] SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
- [arxiv'25] Training Long-Context LLMs Efficiently via Chunk-wise Optimization
- [arxiv'25] SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
- [ASPLOS'25] FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
- [arxiv'25] XAttention: Block Sparse Attention with Antidiagonal Scoring
- [arxiv'25] SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading
- [arxiv'25] ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
- [arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
- [PODC'25] System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
- [arxiv'25] ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
- [arxiv'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
- [arxiv'25] MoBA: Mixture of Block Attention for Long-Context LLMs
- [arxiv'25] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
- [arxiv'25] APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
- [SIGMOD'25] MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
- [arxiv'25] Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning
- [arxiv'25] Adjoint sharding for very long context training of state space models
- [arxiv'24] LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
- [arxiv'24] Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training
- [SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- [arxiv'24] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- [arxiv'24] Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [COLM'24] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
- [arxiv'24] FocusLLM: Scaling LLM's Context by Parallel Decoding
- [Survey π] [IJCAI'24] X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling
- [arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
- [MLSys'24] LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
- [arxiv'24] FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection
- [KDD'24] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
- [CCGrid'24] Apodotiko: Enabling Efficient Serverless Federated Learning in Heterogeneous Environments
- [EuroSys'24] Dordis: Efficient Federated Learning with Dropout-Resilient Differential Privacy
- [arxiv'24] Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
- [SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
- [IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
- [Survey π] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
- [SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
- [MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
- [WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
- [EuroSys'23] REFL: Resource-Efficient Federated Learning
- [VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- [RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
- [TMLR'22] Optimal Client Sampling for Federated Learning
- [ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
- [MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
- [MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
- [MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
- [AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
- [NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
- [NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
- [OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
- [MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
- [MLSys'19] Towards Federated Learning at Scale: System Design
- [Survey π] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey
- [USENIX Security'25] Phantom: Privacy-Preserving Deep Neural Network Model Obfuscation in Heterogeneous TEE and GPU System
- [ASPLOS'24] LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models
- [NeurIPS'24] Nimbus: Secure and Efficient Two-Party Inference for Transformers
- [ACL'24] SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC
- [S&P'24] BOLT: Privacy-Preserving, Accurate and Efficient Inference for Transformers
- [DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
- [ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
- [NeurIPS'22] Iron: Private Inference on Transformers
- [ASPLOS'25] Towards End-to-End Optimization of LLM-based Applications with Ayo
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [OSDI'24] ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications
- [ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (
FrugalMCT
) - [NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply
- [HotOS'25] How I learned to stop worrying and love learned OS policies
- [VLDB'25] E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model
- [SenSys'25] CheckMate: LLM-Powered Approximate Intermittent Computing
- [ICSE'25] Large Language Models as Configuration Validators
- [NeurIPS'24] IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] LLMTune: Accelerate Database Knob Tuning with Large Language Models
- [SIGCOMM'24] NetLLM: Adapting Large Language Models for Networking
- [arxiv'24] LLM-Enhanced Data Management
- [arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
- [arxiv'24] Can Large Language Models Write Parallel Code?
- [arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
- [arxiv'23] Large Language Models for Compiler Optimization
- [VLDB'23] How Large Language Models Will Disrupt Data Management
- [arxiv'25] FlashDMoE: Fast Distributed MoE in a Single Kernel
- [arxiv'25] TileLang: A Composable Tiled Programming Model for AI Systems
- [PLDI'25] Task-Based Tensor Computations on Modern GPUs
- [arxiv'25] Kitsune: Enabling Dataflow Execution on GPUs
- [ICLR'25] ThunderKittens: Simple, Fast, and Adorable Kernels
- [arxiv'24] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs
- [RTAS'24] Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management
- slides: link
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [arxiv'21] Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads
- [SIGMETRICS'21] Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [RTSS'17] GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed
- [arxiv'25] The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
- [arxiv'25] EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
- [NSDI'25] GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters
- [HPCA'25] throttLL'eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
- [arxiv'25] EcoServe: Designing Carbon-Aware AI Inference Systems
- [arxiv'25] Life-Cycle Emissions of AI Hardware: A Cradle-To-Grave Approach and Generational Trends
- [arxiv'24] GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions
- [arxiv'24] EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training
- [arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- [SOSP'24] Perseus: Removing Energy Bloat from Large Model Training
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
- [NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
- [arxiv'25] Patchwork: A Unified Framework for RAG Serving
- [arxiv'25] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'25] RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- [arxiv'25] Long-Context Inference with Retrieval-Augmented Speculative Decoding
- [VLDB'25] Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
- [arxiv'24] Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference
- [arxiv'24] RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation
- [arxiv'24] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation
- [arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'25] Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators
- [NSDI'25] Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
- [ASPLOS'25] Forecasting GPU Performance for Deep Learning Training and Inference
- [MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
- [arxiv'25] Test-Time Training Done Right
- [arxiv'25] LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
- [arxiv'25] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- [arxiv'25] On-Policy RL with Optimal Reward Baseline
- [arxiv'25] MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
- [NSDI'25] Optimizing RLHF Training for Large Language Models with Stage Fusion
- [arxiv'25] Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
- [arxiv'25] Faster Video Diffusion with Trainable Sparse Attention
- [arxiv'25] SSR: Speculative Parallel Scaling Reasoning in Test-time
- [arxiv'25] Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
- [arxiv'25] Reward Reasoning Model
- [arxiv'25] Think Only When You Need with Large Hybrid-Reasoning Models
- [arxiv'25] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
- [MLSys'25] ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
- [MLSys'25] Optimizing LLM Queries in Relational Data Analytics Workloads
- [arxiv'25] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- [arxiv'25] Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
- [arxiv'25] Process Reward Models That Think
- [arxiv'25] StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
- [arxiv'25] Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning
- [arxiv'25] Sleep-time Compute: Beyond Inference Scaling at Test-time
- [arxiv'25] DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- [arxiv'25] SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
- [arxiv'25] Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
- [arxiv'25] OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
- [ASPLOS'25] ReCA: Integrated Acceleration for Real-Time and Efficient Cooperative Embodied Autonomous Agents
- [arxiv'25] NotebookOS: A Notebook Operating System for Interactive Training with On-Demand GPUs
- [arxiv'25] Alchemist: Towards the Design of Efficient Online Continual Learning System
- [arxiv'25] Linear Attention for Efficient Bidirectional Sequence Modeling
- [arxiv'25] S*: Test Time Scaling for Code Generation
- [arxiv'25] Optimizing Model Selection for Compound AI Systems
- [arxiv'25] Copilot Arena: A Platform for Code LLM Evaluation in the Wild
- [arxiv'25] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
- [arxiv'25] Efficient-vDiT: Efficient Video Diffusion Transformers with Attention Tile
- [arxiv'25] BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation
- [arxiv'25] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- [arxiv'25] Mordal: Automated Pretrained Model Selection for Vision Language Models
- [arxiv'25] Adaptive Semantic Prompt Caching with VectorQ
- [EuroSys'25] HybridFlow: A Flexible and Efficient RLHF Framework
- [arxiv'25] Measuring GPU utilization one level deeper
- [arxiv'24] Optimizing RLHF Training for Large Language Models with Stage Fusion
- [arxiv'24] Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
- [arxiv'24] Debunking the CUDA Myth Towards GPU-based AI Systems
- [arxiv'24] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
- [arxiv'24] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
- [CPAL'24 (PMLR)] Jaxpruner: A Concise Library for Sparsity Research
- [arxiv'24] Scorch: A Library for Sparse Deep Learning
- [arxiv'24] Drowning in Documents: Consequences of Scaling Reranker Inference
- [arxiv'24] Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions
- [arxiv'24] Computational Bottlenecks of Training Small-scale Large Language Models
- [Survey π] [arxiv'24] A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
- [arxiv'24] AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
- [ASPLOS'25 (to appear)] PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
- [NeurIPS'24] Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'24] Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
- [arxiv'24] DroidSpeak: Enhancing Cross-LLM Communication
- [arxiv'24] Disaggregating Embedding Recommendation Systems with FlexEMR
- [arxiv'24] JudgeBench: A Benchmark for Evaluating LLM-based Judges
- [arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- [arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
- [Survey π] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
- [ATC'24] Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
- [arxiv'23] Efficiently Programming Large Language Models using SGLang
- [MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
- [arxiv'23] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- [arxiv'22] Training language models to follow instructions with human feedback
This repository is motivated by:
- https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
- https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
- https://github.com/ganler/ResearchReading
- https://jeongseob.github.io/readings_mlsys.html
- https://github.com/chwan1016/awesome-gnn-systems
- https://github.com/ConnollyLeon/awesome-Auto-Parallelism