MLModelScope
The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks but lacks standard tools to evaluate and profile models or systems. Due to the absence of such tools, the current practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error prone --- stifling the adoption of the innovations.
MLModelScope is a hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within AI application pipelines. MLModelScope lowers the cost and effort for performing model evaluation and profiling, making it easier for others to reproduce, evaluate, and analyze acurracy or performance claims of models and systems.
It is designed to aid in:
- reproducing and comparing with published models, and designing models with performance and deployment in mind,
- understanding the model performance (within realworld AI workflows) and its interaction with all levels of the hardware/software stack
- discovering models, frameworks and hardware that are applicable to users' datasets.
To achieve this, MLModelScope:
- Provides a consistent evaluation, aggregation, and reporting system by defining
- techniques to specify and provision workflows with HW/SW stacks
- abstractions for evaluation and profiling using different frameworks
- data consumption for evaluation outputs
- Enables profiling of experiments throughout the entire pipeline and at different abstraction levels (application, model, framework, layer, library and hardware, as shown on the right)
- Is framework agnostic - with current support for PyTorch, TensorFlow, ONNXRuntime, MXNet
- Is extensible and customizable - allowing users to extend MLModelScope by adding models, frameworks, or library and system profilers.
- Can run experiments on separate machines, and behind firewall (does not exposing model weights or machine specification)
- Allows parallel evaluation (multiple instantiations of the same experiment set-up across systems)
- Specifies model and framework resources as asset files which can be added easily, even at runtime
MLModelScope can be used as an application with a command line, API or web interface, or can be compiled into a standalone library. We also provide an online hub of continuously updated assets, evaluation results, and access to hardware resources — allowing users to discover and evaluate models without installing or configuring systems.
python>=3.7
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp-proto-grpc
grpcio
We first discuss a bare minimum pytorch-agent installation without the tracing and profiling capabilities. To make this work, you will need to have the following system libraries preinstalled in your system.
- The CUDA library (required)
- The CUPTI library (required)
- The cuDNN library (required for mxnet-agent)
- The Pytorch Python library (required for pytorch-agent)
- The Tensorflow Python library (required for tensorflow-agent)
- The ONNXRuntime and ONNX Python library (required for onnxruntime-agent)
- The MXNet Python library (required for mxnet-agent)
Please refer to Nvidia CUDA library installation on this. Find the localation of your local CUDA installation, which is typically at /usr/local/cuda/
, and setup the path to the libcublas.so
library.
Please refer to Nvidia CUPTI library installation on this. Find the localation of your local CUPTI installation, which is typically at /usr/local/cuda/extras/CUPTI
, and setup the path to the libcupti.so
library.
Also, please install Pre-requsite Dynamic Library.
On Linux
cd pycupti/csrc
export PATH="/usr/local/cuda/bin:$PATH"
nvcc -O3 --shared -Xcompiler -fPIC utils.cpp -o libutils.so -lcuda -lcudart -lcupti -lnvperf_host -lnvperf_target -I /usr/local/cuda/extras/CUPTI/include -L /usr/local/cuda/extras/CUPTI/lib64
On Windows
cd pycupti/csrc
nvcc -O3 --shared utils.cpp -o utils.dll -I"%CUDA_PATH%/include" -I"%CUDA_PATH%/extras/CUPTI/include" -L"%CUDA_PATH%"/extras/CUPTI/lib64 -L"%CUDA_PATH%"/lib/x64 -lcuda -lcudart -lcupti -lnvperf_host -lnvperf_target -Xcompiler "/EHsc /GL /Gy /O2 /Zc:inline /fp:precise /D "_WINDLL" /Zc:forScope /Oi /MD" && del utils.lib utils.exp
After running above commands, please check whether libutils.so
on Linux or utils.dll
on Windows is in pycupti/csrc
directory.
The Pytorch Python library is required for our pytorch-agent.
You can install Pytorch Python by referencing Pytorch.
PyTorch v1.8.1 with CUDA v11.1 Installation in Anaconda Environment
conda create -n pytorch181cu111 python=3.8
conda activate pytorch181cu111
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
pip install opencv-contrib-python
pip install aenum
pip install requests
pip install psycopg
pip install "psycopg[binary]"
pip install Pika
The Tensorflow Python library is required for our tensorflow-agent.
Tensorflow v2 Installation in Anaconda Environment
conda create -n tf2gpu python=3.8
conda activate tf2gpu
On Windows
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
# Anything above 2.10 is not supported on the GPU on Windows Native
python -m pip install "tensorflow<2.11"
# Verify install:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
pip install opencv-contrib-python
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
pip install aenum
pip install requests
pip install psycopg
pip install "psycopg[binary]"
pip install Pika
Tensorflow v1.14.0 with CUDA v10.0 Installation in Anaconda Environment
conda create -n tf114gpu
conda activate tf114gpu
On Windows
conda install -c anaconda tensorflow-gpu=1.14.0
# Verify install:
python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-exporter-otlp-proto-grpc
pip install grpcio==1.27.2
pip install google-pasta # for tensorflow v1.14.0
pip install opencv-python # conda install -c conda-forge opencv
pip install Pillow
pip install aenum
pip install requests
pip install psycopg
pip install "psycopg[binary]"
pip install Pika
The ONNXRuntime and ONNX Python library is required for our onnxruntime-agent.
ONNXRuntime v1.7.0 Installation in Anaconda Environment
conda create -n ort170 python=3.8
conda activate ort170
On Windows and Linux
pip install onnxruntime-gpu==1.7.0
conda install -c conda-forge cudatoolkit=11.0 cudnn
# Verify install:
# Even if the result is ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'], it does not mean that GPU device can be used.
python -c "import onnxruntime as ort;print(ort.get_available_providers())"
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
pip install aenum
conda install -c pytorch torchvision
conda install -c conda-forge onnx
conda install -c anaconda scipy
pip install opencv-python # conda install -c conda-forge opencv
pip install requests
pip install psycopg
pip install "psycopg[binary]"
pip install Pika
The MXNet Python library is required for our mxnet-agent.
MXNet v1.8.0 with CUDA v10.2 Installation in Anaconda Environment
v1.8.0 cannot be installed on Windows.
cuDNN local installation is required for MXNet.
Before issuing the following commands, you must replace X.Y and v8.x.x.x with your specific CUDA and cuDNN versions and package date.
- Navigate to your directory containing the cuDNN tar file.
- Unzip the cuDNN package.
tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
- Copy the following files into the CUDA toolkit directory.
sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
- Edit the environment variables.
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
References
conda create -n mxnet180cu102 python=3.8
conda activate mxnet180cu102
On Linux
v1.8.0 cannot be installed on Windows
conda install -c anaconda cudatoolkit=10.2 cudnn=7.6.5
conda install -c conda-forge libcblas # important
pip install mxnet-cu102==1.8.0
conda install -c conda-forge nccl
python -m pip uninstall numpy
python -m pip install numpy==1.23.1
# Verify install:
python -c "import mxnet as mx;print(mx.context.num_gpus())"
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
pip install aenum
pip install torchvision==0.9.0
conda install -c anaconda scipy
pip install chardet
pip install opencv-contrib-python
pip install requests
pip install psycopg
pip install "psycopg[binary]"
pip install Pika
With the configuration and the above bare minimumn installation, you should be ready to test the installation and see how things works.
To run an inference using the default DNN model alexnet
with default test input images.
python run_image_classification.py
We now discuss how to install a few external services that make the agent fully useful in terms of collecting tracing and profiling data.
MLModelScope relies on a few external services. These services provide tracing functionality.
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-exporter-otlp-proto-grpc
pip install grpcio
This service is required.
docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 -e COLLECTOR_OTLP_ENABLED=true -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 4317:4317 -p 4318:4318 -p 14250:14250 -p 14268:14268 -p 14269:14269 -p 9411:9411 jaegertracing/all-in-one:1.44
The trace server runs on http://localhost:16686
An example run is
python run_image_classification.py --task image_classification --agent pytorch --model_name alexnet --architecture gpu --num_warmup 2 --dataset_name test --dataset_path ./test_data --batch_size 2
python run_image_classification.py --task image_classification --agent pytorch --model_name alexnet --architecture gpu --num_warmup 2 --dataset_name test --batch_size 2 --gpu_trace false
python run_image_object_detection.py --task image_object_detection --agent pytorch --model_name mobilenet_ssd_v1_0 --architecture cpu --num_warmup 2 --dataset_name test --batch_size 1 --gpu_trace false
python run_image_semantic_segmentation.py --task image_semantic_segmentation --agent tensorflow --model_name deeplabv3_mobilenet_v2_dm_05_pascal_voc_train_aug --architecture cpu --num_warmup 2 --dataset_name test_cv2 --batch_size 1 --gpu_trace false
python run_image_enhancement.py --task image_enhancement --agent pytorch --model_name srgan --architecture cpu --num_warmup 2 --dataset_name test --batch_size 1 --gpu_trace false
python run_image_instance_segmentation.py --task image_instance_segmentation --agent tensorflow --model_name mask_rcnn_inception_v2_coco --architecture cpu --num_warmup 2 --dataset_name test_cv2 --batch_size 1 --gpu_trace false
python run_image_instance_segmentation_raw.py --task image_instance_segmentation_raw --agent tensorflow --model_name mask_rcnn_inception_v2_coco_raw --architecture cpu --num_warmup 2 --dataset_name test_cv2 --batch_size 1 --gpu_trace false
[1] c3sr, “GitHub - c3sr/mlmodelscope: MLModelScope is an open source, extensible, and customizable platform to facilitate evaluation and measurement of ML models within AI pipelines.,” GitHub.
[2] c3sr, “GitHub - c3sr/go-pytorch,” GitHub, Oct. 25, 2021. https://github.com/c3sr/go-pytorch
[3] “PyTorch,” PyTorch. https://www.pytorch.org
[4] “OpenTelemetry,” OpenTelemetry. https://opentelemetry.io/
[5] “Jaeger: open source, end-to-end distributed tracing,” Jaeger: open source, end-to-end distributed tracing, May 30, 2022. https://www.jaegertracing.io/