[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling
-
Updated
Jul 10, 2024 - Python
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
On-device LLM Inference Powered by X-Bit Quantization
Official implementation of "AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising"
[CVPR 2024] DeepCache: Accelerating Diffusion Models for Free
"LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS", Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
[ICLR 2022] Code for Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation (GLNN)
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
[NeurIPS'23] Speculative Decoding with Big Little Decoder
[ICLR24] AutoVP: An Automated Visual Prompting Framework and Benchmark
Explorations into some recent techniques surrounding speculative decoding
[ICML 2023] Linkless Link Prediction via Relational Distillation
Official PyTorch training code of Accelerating Deep Neural Networks via Semi-Structured Activation Sparsity (ICCV2023-RCV)
EfficientFormerV2 [ICCV 2023] & EfficientFormer [NeurIPs 2022]
[BMVC 2022] Wide Feature Projection with Fast and Memory-Economic Attention for Efficient Image Super-Resolution
Code for Learning to Zoom and Unzoom (CVPR 2023)
Compute-efficient reinforcement learning with binary neural networks and evolution strategies.
Add a description, image, and links to the efficient-inference topic page so that developers can more easily learn about it.
To associate your repository with the efficient-inference topic, visit your repo's landing page and select "manage topics."