Inference System

System for machine learning inference.

Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." arXiv preprint arXiv:1911.02549 (2019). [Paper] [GitHub]
Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277. [Paper]

Model Zoo (Experiment Version Control)

TRAINS - Auto-Magical Experiment Manager & Version Control for AI [GitHub]
ModelDB: A system to manage ML models [GitHub] [MIT short paper]
iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]

Model Serving

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. [arxiv][GitHub]
- Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia.
- arXiv Preprint. 2019.
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems(SOSP2019) [Paper] [GitHub]
INFaaS: A Model-less Inference Serving System Romero [Paper] [GitHub]
- F., Li, Q., Yadwadkar, N.J. and Kozyrakis, C., 2019.
- arXiv preprint arXiv:1905.13348.
Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster (SOSP2019) [Paper] [GitHub]
Deep Learning Inference Service at Microsoft [Paper]
- J Soifer, et al. (OptML2019)
{PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
- Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
Brusta: PyTorch model serving project [GitHub]
Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
- Baylor, Denis, et al. (KDD 2017)
Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
- Olston, Christopher, et al.
IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
- Crankshaw, Daniel, et al. (NSDI 2017)
- Summary: Adaptive batch
InferLine: ML Inference Pipeline Composition Framework [Paper] [GitHub]
- Crankshaw, Daniel, et al. (Preprint)
- Summary: update version of Clipper
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
- Dakkak, Abdul, et al (Preprint)
- Summary: model cold start problem
Rafiki: machine learning as an analytics service system [Paper] [GitHub]
- Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
- Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
Deepcpu: Serving rnn-based deep learning models 10x faster. [Paper]
- Zhang, M., Rajbhandari, S., Wang, W. and He, Y., 2018. (ATC2018)
Orkhon: ML Inference Framework and Server Runtime [GitHub]
NVIDIA/tensorrt-inference-server: The TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. [GitHub]
Apache PredictionIO® is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task [Website]

Inference Optimization

TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators. [GitHub]
Dynamic Space-Time Scheduling for GPU Inference [Paper] [GitHub]
- Jain, Paras, et al. (NIPS 18, System for ML)
- Summary: optimization for GPU Multi-tenancy
Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
- Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
- D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
- Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster [Paper]
- Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research (ATC 2018)

Machine Learning Compiler

{TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube] [Project Website]
- Chen, Tianqi, et al. (OSDI 2018)
- Summary: Automated optimization is very impressive: cost model (rank objective function) + schedule explorer (parallel simulated annealing)
Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]
TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions [Paper] [GitHub]
- Jia, Zhihao, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. (SOSP 2019)
- Experiments tested on TVM and XLA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference.md

inference.md

Inference System

Model Zoo (Experiment Version Control)

Model Serving

Inference Optimization

Machine Learning Compiler

Files

inference.md

Latest commit

History

inference.md

File metadata and controls

Inference System

Model Zoo (Experiment Version Control)

Model Serving

Inference Optimization

Machine Learning Compiler