Skip to content

Latest commit

 

History

History
80 lines (69 loc) · 7.96 KB

inference.md

File metadata and controls

80 lines (69 loc) · 7.96 KB

Inference System

System for machine learning inference.

  • Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." arXiv preprint arXiv:1911.02549 (2019). [Paper] [GitHub]
  • Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277. [Paper]

Model Zoo (Experiment Version Control)

  • TRAINS - Auto-Magical Experiment Manager & Version Control for AI [GitHub]
  • ModelDB: A system to manage ML models [GitHub] [MIT short paper]
  • iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]

Model Serving

  • Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. [arxiv][GitHub]
    • Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia.
    • arXiv Preprint. 2019.
  • Parity Models: Erasure-Coded Resilience for Prediction Serving Systems(SOSP2019) [Paper] [GitHub]
  • INFaaS: A Model-less Inference Serving System Romero [Paper] [GitHub]
    • F., Li, Q., Yadwadkar, N.J. and Kozyrakis, C., 2019.
    • arXiv preprint arXiv:1905.13348.
  • Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster (SOSP2019) [Paper] [GitHub]
  • Deep Learning Inference Service at Microsoft [Paper]
    • J Soifer, et al. (OptML2019)
  • {PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
    • Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
  • Brusta: PyTorch model serving project [GitHub]
  • Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
  • TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
    • Baylor, Denis, et al. (KDD 2017)
  • Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
    • Olston, Christopher, et al.
  • IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
  • Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (NSDI 2017)
    • Summary: Adaptive batch
  • InferLine: ML Inference Pipeline Composition Framework [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (Preprint)
    • Summary: update version of Clipper
  • TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
    • Dakkak, Abdul, et al (Preprint)
    • Summary: model cold start problem
  • Rafiki: machine learning as an analytics service system [Paper] [GitHub]
    • Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
    • Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
  • GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
  • Deepcpu: Serving rnn-based deep learning models 10x faster. [Paper]
    • Zhang, M., Rajbhandari, S., Wang, W. and He, Y., 2018. (ATC2018)
  • Orkhon: ML Inference Framework and Server Runtime [GitHub]
  • NVIDIA/tensorrt-inference-server: The TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. [GitHub]
  • Apache PredictionIO® is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task [Website]

Inference Optimization

  • TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators. [GitHub]
  • Dynamic Space-Time Scheduling for GPU Inference [Paper] [GitHub]
    • Jain, Paras, et al. (NIPS 18, System for ML)
    • Summary: optimization for GPU Multi-tenancy
  • Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
    • Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
  • Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
    • D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
    • Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.
  • DeepCPU: Serving RNN-based Deep Learning Models 10x Faster [Paper]
    • Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research (ATC 2018)

Machine Learning Compiler

  • {TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube] [Project Website]
    • Chen, Tianqi, et al. (OSDI 2018)
    • Summary: Automated optimization is very impressive: cost model (rank objective function) + schedule explorer (parallel simulated annealing)
  • Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
  • Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
  • PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]
  • TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions [Paper] [GitHub]
    • Jia, Zhihao, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. (SOSP 2019)
    • Experiments tested on TVM and XLA